当前位置:   article > 正文

windows系统下配置Mamba环境_windows mamba

windows mamba

最近在学习AI知识,试图在安装最近爆火的mamba模型时遇到困难。参考网上各位达人,经过努力,终于在win11系统anaconda中成功编译安装了mamba包。

我用的是visual studio 2019

1.在windows下构建Mamba使用环境:

conda create -n mamba python=3.10

conda activate mamba

注:Mamba需要triton,然而triton没有windows版,有人编译了triton2.0.0的windows版本,但python是3.10的。所以需要创建python-3.10的环境

2.安装torch

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

这里torch很大,下载很慢,我是用迅雷下载后安装的

3.安装cmake

pip install cmake

4.安装triton

下载triton2.0.0的windows二进制文件,地址:https://hf-mirror.com/r4ziel/xformers_pre_built/blob/main/triton-2.0.0-cp310-cp310-win_amd64.whl,安装:

pip install D:\Downloads\triton-2.0.0-cp310-cp310-win_amd64.whl

5.下载mamba的源码并安装依赖包

git clone https://github.com/state-spaces/mamba,

切换到mamba目录:

cd mamba

修改setup.py,添加编译参数:-DM_LOG2E=1.44269504

注:selective_scan_fwd_kernel.cuh和selective_scan_bwd_kernel.cuh使用了次预定义常量,但文件里没定义。

将csrc\selective_scan\selective_scan_fwd_kernel.cuh中selective_scan_fwd_launch和csrc\selective_scan\selective_scan_bwd_kernel.cuh中selective_scan_bwd_launch里最内层的匿名函数内容抽取为独立方法。

selective_scan_fwd_kernel.cuh:

  1. template<int kNThreads, int kNItems, int kNRows, bool kIsEvenLen, bool kIsVariableB, bool kIsVariableC, bool kHasZ, typename input_t, typename weight_t>
  2. void selective_scan_fwd_call(SSMParamsBase &params, cudaStream_t stream) {
  3. using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
  4. // constexpr int kSmemSize = Ktraits::kSmemSize;
  5. constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
  6. // printf("smem_size = %d\n", kSmemSize);
  7. dim3 grid(params.batch, params.dim / kNRows);
  8. auto kernel = &selective_scan_fwd_kernel<Ktraits>;
  9. if (kSmemSize >= 48 * 1024) {
  10. C10_CUDA_CHECK(cudaFuncSetAttribute(
  11. kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
  12. }
  13. kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
  14. C10_CUDA_KERNEL_LAUNCH_CHECK();
  15. }
  16. template<int kNThreads, int kNItems, typename input_t, typename weight_t>
  17. void selective_scan_fwd_launch(SSMParamsBase &params, cudaStream_t stream) {
  18. // Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
  19. // processing 1 row.
  20. BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
  21. BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
  22. BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
  23. BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
  24. selective_scan_fwd_call<kNThreads, kNItems, 1, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>(params, stream);
  25. });
  26. });
  27. });
  28. });
  29. }

selective_scan_bwd_kernel.cuh:

  1. template<int kNThreads, int kNItems, bool kIsEvenLen, bool kIsVariableB, bool kIsVariableC, bool kDeltaSoftplus, bool kHasZ, typename input_t, typename weight_t>
  2. void selective_scan_bwd_call(SSMParamsBwd &params, cudaStream_t stream) {
  3. using Ktraits = Selective_Scan_bwd_kernel_traits<kNThreads, kNItems, kIsEvenLen, kIsVariableB, kIsVariableC, kDeltaSoftplus, kHasZ, input_t, weight_t>;
  4. // using Ktraits = Selective_Scan_bwd_kernel_traits<kNThreads, kNItems, true, kIsVariableB, kIsVariableC, kDeltaSoftplus, kHasZ, input_t, weight_t>;
  5. // TODO: check this
  6. constexpr int kSmemSize = Ktraits::kSmemSize + MAX_DSTATE * sizeof(typename Ktraits::scan_t) + (kNThreads + 4 * MAX_DSTATE) * sizeof(typename Ktraits::weight_t);
  7. // printf("smem_size = %d\n", kSmemSize);
  8. dim3 grid(params.batch, params.dim);
  9. auto kernel = &selective_scan_bwd_kernel<Ktraits>;
  10. if (kSmemSize >= 48 * 1024) {
  11. C10_CUDA_CHECK(cudaFuncSetAttribute(
  12. kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
  13. }
  14. kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
  15. C10_CUDA_KERNEL_LAUNCH_CHECK();
  16. }
  17. template<int kNThreads, int kNItems, typename input_t, typename weight_t>
  18. void selective_scan_bwd_launch(SSMParamsBwd &params, cudaStream_t stream) {
  19. BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
  20. BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
  21. BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
  22. BOOL_SWITCH(params.delta_softplus, kDeltaSoftplus, [&] {
  23. BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
  24. selective_scan_bwd_call<kNThreads, kNItems, kIsEvenLen, kIsVariableB, kIsVariableC, kDeltaSoftplus, kHasZ, input_t, weight_t>(params, stream);
  25. });
  26. });
  27. });
  28. });
  29. });
  30. }

注:估计是宏嵌套过多,展开后代码超过编译器允许长度了。

设置环境变量:

set MAMBA_FORCE_BUILD=TRUE

pip install .

编译安装成功,在虚拟环境的Lib\site-packages目录下,多了422MB文件:selective_scan_cuda.cp310-win_amd64.pyd。

6.安装causal-conv1d(可选)

git clone https://github.com/Dao-AILab/causal-conv1d.git

cd causal-conv1d

set CAUSAL_CONV1D_FORCE_BUILD=TRUE

打开csrc\causal_conv1d.cpp文件:

159、277、278行:将判断条件中的and改为&&

pip install .

编译安装成功,在虚拟环境的Lib\site-packages目录下,多了157MB文件:causal_conv1d_cuda.cp310-win_amd64.pyd。

说明:

代码构建时出现错误:

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

其实是ninja编译过程中发生了错误。

代码构建失败后,build目录会被删除,重新构建很花时间。

我注释了D:\anaconda3\envs\mamba\Lib\site-packages\pip\_internal\wheel_builder.py的285行的call_subprocess调用,使build目录保留下来,

然后cd build\temp.win-amd64-cpython-310\Release到build.ninja所在目录,

再执行ninja,重现错误就快很多了。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/677691
推荐阅读
相关标签
  

闽ICP备14008679号