python – MATLAB矩阵乘法性能比NumPy快5倍
我在MATLAB& amp;中设置了两个相同的测试.关于矩阵乘法与广播的
Python.对于
Python,我使用了NumPy,对于MATLAB,我使用了使用BLAS的
mtimesx库.
MATLAB close all; clear; N = 1000 + 100; % a few initial runs to be trimmed off at the end a = 100; b = 30; c = 40; d = 50; A = rand(b,c,a); B = rand(c,d,a); C = zeros(b,a); times = zeros(1,N); for ii = 1:N tic C = mtimesx(A,B); times(ii) = toc; end times = times(101:end) * 1e3; plot(times); grid on; title(median(times)); Python import timeit import numpy as np import matplotlib.pyplot as plt N = 1000 + 100 # a few initial runs to be trimmed off at the end a = 100 b = 30 c = 40 d = 50 A = np.arange(a * b * c).reshape([a,b,c]) B = np.arange(a * c * d).reshape([a,d]) C = np.empty(a * b * d).reshape([a,d]) times = np.empty(N) for i in range(N): start = timeit.default_timer() C = A @ B times[i] = timeit.default_timer() - start times = times[101:] * 1e3 plt.plot(times,linewidth=0.5) plt.grid() plt.title(np.median(times)) plt.show() >我的Python是使用OpenBLAS从pip安装的默认Python. MATLAB代码在1ms运行,而Python在5.8ms运行,我无法弄清楚为什么,因为它们似乎都在使用BLAS. 编辑 来自Anaconda: In [7]: np.__config__.show() mkl_info: libraries = ['mkl_rt'] library_dirs = [...] define_macros = [('SCIPY_MKL_H',None),('HAVE_CBLAS',None)] include_dirs = [...] blas_mkl_info: libraries = ['mkl_rt'] library_dirs = [...] define_macros = [('SCIPY_MKL_H',None)] include_dirs = [...] blas_opt_info: libraries = ['mkl_rt'] library_dirs = [...] define_macros = [('SCIPY_MKL_H',None)] include_dirs = [...] lapack_mkl_info: libraries = ['mkl_rt'] library_dirs = [...] define_macros = [('SCIPY_MKL_H',None)] include_dirs = [...] lapack_opt_info: libraries = ['mkl_rt'] library_dirs = [...] define_macros = [('SCIPY_MKL_H',None)] include_dirs = [...] 从numpy使用pip In [2]: np.__config__.show() blas_mkl_info: NOT AVAILABLE blis_info: NOT AVAILABLE openblas_info: library_dirs = [...] libraries = ['openblas'] language = f77 define_macros = [('HAVE_CBLAS',None)] blas_opt_info: library_dirs = [...] libraries = ['openblas'] language = f77 define_macros = [('HAVE_CBLAS',None)] lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: library_dirs = [...] libraries = ['openblas'] language = f77 define_macros = [('HAVE_CBLAS',None)] lapack_opt_info: library_dirs = [...] libraries = ['openblas'] language = f77 define_macros = [('HAVE_CBLAS',None)] 编辑2 解决方法
您的MATLAB代码使用浮点数组,但NumPy代码使用整数数组.这使得时间有显着差异.对于MATLAB和NumPy之间的“苹果到苹果”比较,Python / NumPy代码也必须使用浮点数组.
然而,这不是唯一重要的问题.在NumPy github网站上的issue 7569(以及issue 8957年)中讨论的NumPy存在缺陷. “堆叠”数组的矩阵乘法不使用快速BLAS例程来执行乘法.这意味着具有两个以上维度的数组的乘法可能比预期慢得多. 2-d数组的乘法确实使用快速例程,因此您可以通过在循环中将各个2-d数组相乘来解决此问题.令人惊讶的是,尽管有Python循环的开销,但在很多情况下,它比@,matmul或einsum更快地应用于完整堆叠阵列. 这是NumPy问题中显示的函数的变体,它在Python循环中执行矩阵乘法: def xmul(A,B): """ Multiply stacked matrices A (with shape (s,m,n)) by stacked matrices B (with shape (s,n,p)) to produce an array with shape (s,p). Mathematically equivalent to A @ B,but faster in many cases. The arguments are not validated. The code assumes that A and B are numpy arrays with the same data type and with shapes described above. """ out = np.empty((a.shape[0],a.shape[1],b.shape[2]),dtype=a.dtype) for j in range(a.shape[0]): np.matmul(a[j],b[j],out=out[j]) return out 我的NumPy安装也使用MKL(它是Anaconda发行版的一部分).下面是A @ B和xmul(A,B)的时序比较,使用浮点值数组: In [204]: A = np.random.rand(100,30,40) In [205]: B = np.random.rand(100,40,50) In [206]: %timeit A @ B 4.76 ms ± 6.37 μs per loop (mean ± std. dev. of 7 runs,100 loops each) In [207]: %timeit xmul(A,B) 582 μs ± 35.9 μs per loop (mean ± std. dev. of 7 runs,1000 loops each) 即使xmul使用Python循环,它也只需要A @ B的1/8. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |