如何在自定义TensorFlow C op中调用sgemm

发布时间：2020-12-16 06:53:02 所属栏目：百科来源：网络整理

导读：我跟着 tutorial on how to define my own op for TensorFlow in C++. 我想在我的自定义TensorFlow C op中调用sgemm.我正在编写两个内核,一个用于CUDA,另一个用于CPU.在每种情况下,sgemm调用怎么样？或者是否存在适用于这两种情况的通用方法？我尝试使用此

我跟着 tutorial on how to define my own op for TensorFlow in C++.

我想在我的自定义TensorFlow C op中调用sgemm.我正在编写两个内核,一个用于CUDA,另一个用于CPU.在每种情况下,sgemm调用怎么样？或者是否存在适用于这两种情况的通用方法？

我尝试使用此代码段,但由于缺少包含文件,我无法使其工作(请参阅here)：

auto dev_ctx = context->op_device_context();
auto* dev_stream = dev_ctx->stream();
OP_REQUIRES(context,dev_stream,errors::Internal("No stream available."));

bool blas_launch_status =
    dev_stream
         ->ThenBlasGemm(...

此外,不确定这是否是通用的,或者仅适用于CUDA.

这有记录吗？

如何在我的GPU / CUDA实现中调用cublasSgemm？
或者更准确地说,如何获得cublasHandle_t？

我在TF代码中搜索了一下,并且有一个class CUDABlas似乎提供了围绕cuBLAS函数的包装器.我需要使用它还是可以直接使用cublasSgemm？
我想我需要使用包装器,因为这将确保CUDA流执行器保持在一个理智的状态？我如何使用包装器？

我还发现contrib/rnn/kernels/blas_gemm.cc和core/kernels/matmul_op.cc似乎做了我想要的.代码如下所示：

#define EIGEN_USE_THREADS

#if GOOGLE_CUDA
#include "tensorflow/core/platform/stream_executor.h"
#endif  // GOOGLE_CUDA

#include "tensorflow/contrib/rnn/kernels/blas_gemm.h"
#include "tensorflow/core/framework/op_kernel.h"
namespace tensorflow {

#if GOOGLE_CUDA
namespace {
template <typename T>
perftools::gputools::DeviceMemory<T> AsDeviceMemory(const T* cuda_memory) {
  perftools::gputools::DeviceMemoryBase wrapped(const_cast<T*>(cuda_memory));
  perftools::gputools::DeviceMemory<T> typed(wrapped);
  return typed;
}
}  // namespace
#endif  // GOOGLE_CUDA

namespace functor {
template <typename T>
void TensorCuBlasGemm<T>::operator()(OpKernelContext* ctx,bool transa,bool transb,uint64 m,uint64 n,uint64 k,T alpha,const T* a,int lda,const T* b,int ldb,T beta,T* c,int ldc) {
#if GOOGLE_CUDA
  perftools::gputools::blas::Transpose trans[] = {
      perftools::gputools::blas::Transpose::kNoTranspose,perftools::gputools::blas::Transpose::kTranspose};

  auto a_ptr = AsDeviceMemory(a);
  auto b_ptr = AsDeviceMemory(b);
  auto c_ptr = AsDeviceMemory(c);

  bool blas_launch_status =
      ctx->op_device_context()
          ->stream()
          ->ThenBlasGemm(trans[transa],trans[transb],m,n,k,alpha,a_ptr,lda,b_ptr,ldb,beta,&c_ptr,ldc)
          .ok();
  OP_REQUIRES(ctx,blas_launch_status,errors::Aborted("CuBlasGemm failed!"));
#else
  ctx->SetStatus(errors::InvalidArgument("CuBlasGemm needs CUDA."));
#endif
}

即在我的计算(OpKernelContext * ctx)中,我会打电话

ctx->op_device_context()
      ->stream()
      ->ThenBlasGemm(...)

我试过了,但似乎有一些包含标题丢失了(TensorFlow 0.12.0 with GPU for Linux).我得到了致命的错误：tensorflow / stream_executor / lib / status.h：没有这样的文件或目录.我报道上游here.

有没有关于所有这些的文档,即如何处理cuBLAS,或者这个DeviceStream接口,流执行器逻辑等？

我目前的解决方案有点像黑客.对于CPU,我尝试链接系统上的一些可用的Blas库,并从那里使用sgemm.对于CUDA,我链接到tensorflow / contrib / rnn / python / ops / _lstm_ops.so,因为在那里我找到了TensorCuBlasGemm,我可以使用它.参见here.基本上,在该贡献中,他们面临同样的问题,并提出this.但这部分取决于一般不可用的包含文件,请参阅上面的问题.

解决方法

您可以尝试以下适用于我的以下内容：
在开头的* .cu.cc文件中：

#include <cublas_v2.h>
cublasHandle_t cublas_handle = NULL;

在仿函数实现中的相同* .cu.cc文件中：

if (cublas_handle == NULL)
{
    assert(cublasCreate(&cublas_handle) == CUBLAS_STATUS_SUCCESS);
    asert(cublasSetStream(cublas_handle,d.stream()) == CUBLAS_STATUS_SUCCESS);
}

其中d从* .cc文件作为参数传递到仿函数中,其值为ctx-> eigen_device< Eigen :: GpuDevice>()

希望这会有所帮助,欢呼！

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!