当前位置:   article > 正文

使用YOLOv5-Lite在树莓派4b上部署车辆检测模型(三)——在树莓派4b上部署模型_v5lite-e.pt

v5lite-e.pt

目录

一、前言

二、树莓派环境配置

三、MNN框架编译及模型转换

四、推理

参考博文


一、前言

YOLOv5-Lite项目:https://github.com/ppogg/YOLOv5-Lite

使用YOLOv5-Lite在树莓派4b上部署车辆检测模型(一)——UA-DETRAC车辆检测数据集的处理-CSDN博客

使用YOLOv5-Lite在树莓派4b上部署车辆检测模型(二)——使用数据集训练模型-CSDN博客

        在前面的两篇文章中,我们处理好了UA-DETRAC数据集,并使用这个数据集训练了v5Lite-e模型。现在我们有了训练出的pt格式的权重文件,暂且称为v5Lite-e.pt,接下来需要将这个权重文件转换格式,然后在树莓派4B上部署。

二、树莓派环境配置

        系统镜像源推荐使用官方下载器Raspberry Pi Imager,下载网址:https://www.raspberrypi.com/software/icon-default.png?t=N7T8https://www.raspberrypi.com/software/        系统选择raspbian,不同大版本的raspbian环境有微妙的区别,建议选择64位的bullseye版本系统。官方镜像源网站:

https://www.raspberrypi.com/software/operating-systems/#raspberry-pi-os-legacy-64-biticon-default.png?t=N7T8https://www.raspberrypi.com/software/operating-systems/#raspberry-pi-os-legacy-64-bit        烧录好镜像源后,在树莓派上执行下面的命令安装依赖:

  1. sudo apt-get install libprotobuf-dev protobuf-compiler
  2. sudo apt-get install cmake
  3. sudo apt-get install libopencv-dev

        如果要使用摄像头,执行sudo raspi-config打开配置面板,激活摄像头。

三、MNN框架编译及模型转换

        MNN框架选择2.7.1版本,执行下面命令进行编译:

  1. mkdir build
  2. cd build
  3. cmake .. -DMNN_BUILD_CONVERTER=ON -DMNN_BUILD_TOOL=ON -DMNN_BUILD_QUANTOOLS=ON -DMNN_EVALUATION=ON -DMNN_SUPPORT_BF16=ON -DMNN_ARM82=ON -DMNN_BUILD_OPENCV=ON -DMNN_USE_OPENCV=ON
  4. make -j

       cmake的参数介绍见MNN项目wiki:https://github.com/alibaba/MNN/wiki/cmake#%E7%BC%96%E8%AF%91%E5%AE%8F%E4%BB%8B%E7%BB%8D

        编译过程可能会报错:

/home/pi/MNN-2.7.1/source/backend/cpu/arm/arm64/bf16/ARMV86_MNNPackedMatMulRemain_BF16.S:158: Fatal error: macros nested too deeply

        这是因为源码的汇编指令嵌套过深,会导致编译时定义的宏无法展开,需要将报错文件中的FMAX和FMIN这两个指令展开。

        将ARMV86_MNNPackedMatMul_BF16.S、ARMV86_MNNPackedMatMulRemain_BF16.S这两个文件的内容替换为下面给出的即可解决报错。

        ARMV86_MNNPackedMatMul_BF16.S

  1. //
  2. // ARMV86_MNNPackedMatMul_BF16.S
  3. // MNN
  4. //
  5. // Created by MNN on 2022/10/09.
  6. // Copyright © 2018-2021 Alibaba Group Holding Limited
  7. //
  8. #ifdef __aarch64__
  9. #include "MNNAsmGlobal.h"
  10. .text
  11. .align 5
  12. .macro SET_ZERO d0, d1, d2, d3
  13. movi \d0\().4s, #0
  14. movi \d1\().4s, #0
  15. movi \d2\().4s, #0
  16. movi \d3\().4s, #0
  17. .endm
  18. .macro Float32ToBf16 d0, d1, d2, d3
  19. shrn \d0\().4h, \d0\().4s, #16
  20. shrn \d1\().4h, \d1\().4s, #16
  21. shrn \d2\().4h, \d2\().4s, #16
  22. shrn \d3\().4h, \d3\().4s, #16
  23. .endm
  24. .macro SET_BIAS s, d0, d1, d2, d3
  25. mov \d0\().16b, \s\().16b
  26. mov \d1\().16b, \s\().16b
  27. mov \d2\().16b, \s\().16b
  28. mov \d3\().16b, \s\().16b
  29. .endm
  30. // 12 * 8 * 4 MatMul
  31. asm_function ARMV86_MNNPackedMatMul_BF16
  32. //void ARMV86_MNNPackedMatMul_BF16(float* C, const float* A, const float* B, const size_t* parameter, const float* postParameters, const float* bias);
  33. // x0: C, x1:A, x2:B, x3:parameter, x4: postParameters, x5:bias
  34. stp d14, d15, [sp, #-64]!
  35. stp d12, d13, [sp, #16]
  36. stp d10, d11, [sp, #32]
  37. stp d8, d9, [sp, #48]
  38. //ldr x8, [x3, #0] // deprecated
  39. ldr x9, [x3, #8] // l
  40. ldr x10, [x3, #16] // h
  41. mov x11, #64 // B_stride = LP * HP = 4 * 8 * sizeof(int16_t)
  42. ldr x13, [x3, #24] // cStride
  43. ldr x7, [x3, #40] // bExtraStride
  44. add x10, x10, #3
  45. lsr x10, x10, #2
  46. add x9, x9, #3
  47. lsr x9, x9, #2
  48. cbz x4, Start
  49. ld1 {v5.4s}, [x4]
  50. mov w17, v5.s[2] // min value
  51. mov w18, v5.s[3] // max value
  52. Start:
  53. cmp x10, #2
  54. blt LH4
  55. LH8:
  56. sub x14, x13, #96 // cStride - 96
  57. LoopH:
  58. mov x15, x1
  59. mov x12, x9
  60. cbz x5, NoBiasH8
  61. ld1 {v0.4h, v1.4h}, [x5], #16 // 8 * sizeof(int16_t)
  62. shll v0.4s, v0.4h, #16
  63. shll v1.4s, v1.4h, #16
  64. mov v2.16b, v0.16b
  65. mov v3.16b, v1.16b
  66. uzp1 v18.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  67. uzp2 v19.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  68. uzp1 v30.2d, v1.2d, v3.2d // bias_0, bias_1, bias_0, bias_1
  69. uzp2 v31.2d, v1.2d, v3.2d // bias_2, bias_3, bias_2, bias_3
  70. SET_BIAS v18, v8, v10, v12, v14
  71. mov v16.16b, v18.16b
  72. SET_BIAS v19, v9, v11, v13, v15
  73. mov v17.16b, v19.16b
  74. SET_BIAS v30, v20, v22, v24, v26
  75. mov v28.16b, v30.16b
  76. SET_BIAS v31, v21, v23, v25, v27
  77. mov v29.16b, v31.16b
  78. b LoopL
  79. NoBiasH8:
  80. SET_ZERO v8, v9, v10, v11
  81. SET_ZERO v12, v13, v14, v15
  82. SET_ZERO v16, v17, v18, v19
  83. SET_ZERO v20, v21, v22, v23
  84. SET_ZERO v24, v25, v26, v27
  85. SET_ZERO v28, v29, v30, v31
  86. LoopL:
  87. // A [12, 4, bf16] : rn = 6 : v2 - v7
  88. // B [ 8, 4, bf16] : rn = 2 : v0 - v1
  89. // C [12, 8, fp32] : rn = 24 : v8 - v31
  90. ld1 {v2.8h, v3.8h, v4.8h, v5.8h}, [x15], #64 // A: 8 * 4 * sizeof(int16_t)
  91. ld1 {v6.8h, v7.8h}, [x15], #32 // A: 4 * 4 * sizeof(int16_t)
  92. ld1 {v0.8h, v1.8h}, [x2], #32 // B: 4 * 4 * sizeof(int16_t)
  93. .inst 0x6e40ec48 // bfmmla v8.4s, v2.8h, v0.8h
  94. .inst 0x6e41ec49 // bfmmla v9.4s, v2.8h, v1.8h
  95. .inst 0x6e40ec6a // bfmmla v10.4s, v3.8h, v0.8h
  96. .inst 0x6e41ec6b // bfmmla v11.4s, v3.8h, v1.8h
  97. .inst 0x6e40ec8c // bfmmla v12.4s, v4.8h, v0.8h
  98. .inst 0x6e41ec8d // bfmmla v13.4s, v4.8h, v1.8h
  99. .inst 0x6e40ecae // bfmmla v14.4s, v5.8h, v0.8h
  100. .inst 0x6e41ecaf // bfmmla v15.4s, v5.8h, v1.8h
  101. .inst 0x6e40ecd0 // bfmmla v16.4s, v6.8h, v0.8h
  102. .inst 0x6e41ecd1 // bfmmla v17.4s, v6.8h, v1.8h
  103. .inst 0x6e40ecf2 // bfmmla v18.4s, v7.8h, v0.8h
  104. .inst 0x6e41ecf3 // bfmmla v19.4s, v7.8h, v1.8h
  105. ld1 {v0.8h, v1.8h}, [x2], #32 // B: 4 * 4 * sizeof(int16_t)
  106. .inst 0x6e40ec54 // bfmmla v20.4s, v2.8h, v0.8h
  107. .inst 0x6e41ec55 // bfmmla v21.4s, v2.8h, v1.8h
  108. .inst 0x6e40ec76 // bfmmla v22.4s, v3.8h, v0.8h
  109. .inst 0x6e41ec77 // bfmmla v23.4s, v3.8h, v1.8h
  110. .inst 0x6e40ec98 // bfmmla v24.4s, v4.8h, v0.8h
  111. .inst 0x6e41ec99 // bfmmla v25.4s, v4.8h, v1.8h
  112. .inst 0x6e40ecba // bfmmla v26.4s, v5.8h, v0.8h
  113. .inst 0x6e41ecbb // bfmmla v27.4s, v5.8h, v1.8h
  114. .inst 0x6e40ecdc // bfmmla v28.4s, v6.8h, v0.8h
  115. .inst 0x6e41ecdd // bfmmla v29.4s, v6.8h, v1.8h
  116. .inst 0x6e40ecfe // bfmmla v30.4s, v7.8h, v0.8h
  117. .inst 0x6e41ecff // bfmmla v31.4s, v7.8h, v1.8h
  118. subs x12, x12, #1
  119. bgt LoopL
  120. LoopLEnd:
  121. uzp1 v7.2d, v8.2d, v9.2d
  122. uzp2 v8.2d, v8.2d, v9.2d
  123. uzp1 v9.2d, v10.2d, v11.2d
  124. uzp2 v10.2d, v10.2d, v11.2d
  125. uzp1 v11.2d, v12.2d, v13.2d
  126. uzp2 v12.2d, v12.2d, v13.2d
  127. uzp1 v13.2d, v14.2d, v15.2d
  128. uzp2 v14.2d, v14.2d, v15.2d
  129. uzp1 v15.2d, v16.2d, v17.2d
  130. uzp2 v16.2d, v16.2d, v17.2d
  131. uzp1 v17.2d, v18.2d, v19.2d
  132. uzp2 v18.2d, v18.2d, v19.2d
  133. uzp1 v19.2d, v20.2d, v21.2d
  134. uzp2 v20.2d, v20.2d, v21.2d
  135. uzp1 v21.2d, v22.2d, v23.2d
  136. uzp2 v22.2d, v22.2d, v23.2d
  137. uzp1 v23.2d, v24.2d, v25.2d
  138. uzp2 v24.2d, v24.2d, v25.2d
  139. uzp1 v25.2d, v26.2d, v27.2d
  140. uzp2 v26.2d, v26.2d, v27.2d
  141. uzp1 v27.2d, v28.2d, v29.2d
  142. uzp2 v28.2d, v28.2d, v29.2d
  143. uzp1 v29.2d, v30.2d, v31.2d
  144. uzp2 v30.2d, v30.2d, v31.2d
  145. cbz x4, StoreLH8
  146. PostTreatLH8:
  147. dup v5.4s, w17
  148. dup v6.4s, w18
  149. fmax v7.4s, v7.4s, v5.4s
  150. fmax v8.4s, v8.4s, v5.4s
  151. fmax v9.4s, v9.4s, v5.4s
  152. fmax v10.4s, v10.4s, v5.4s
  153. fmax v11.4s, v11.4s, v5.4s
  154. fmax v12.4s, v12.4s, v5.4s
  155. fmax v13.4s, v13.4s, v5.4s
  156. fmax v14.4s, v14.4s, v5.4s
  157. fmax v15.4s, v15.4s, v5.4s
  158. fmax v16.4s, v16.4s, v5.4s
  159. fmax v17.4s, v17.4s, v5.4s
  160. fmax v18.4s, v18.4s, v5.4s
  161. fmax v19.4s, v19.4s, v5.4s
  162. fmax v20.4s, v20.4s, v5.4s
  163. fmax v21.4s, v21.4s, v5.4s
  164. fmax v22.4s, v22.4s, v5.4s
  165. fmax v23.4s, v23.4s, v5.4s
  166. fmax v24.4s, v24.4s, v5.4s
  167. fmax v25.4s, v25.4s, v5.4s
  168. fmax v26.4s, v26.4s, v5.4s
  169. fmax v27.4s, v27.4s, v5.4s
  170. fmax v28.4s, v28.4s, v5.4s
  171. fmax v29.4s, v29.4s, v5.4s
  172. fmax v30.4s, v30.4s, v5.4s
  173. fmin v7.4s, v7.4s, v6.4s
  174. fmin v8.4s, v8.4s, v6.4s
  175. fmin v9.4s, v9.4s, v6.4s
  176. fmin v10.4s, v10.4s, v6.4s
  177. fmin v11.4s, v11.4s, v6.4s
  178. fmin v12.4s, v12.4s, v6.4s
  179. fmin v13.4s, v13.4s, v6.4s
  180. fmin v14.4s, v14.4s, v6.4s
  181. fmin v15.4s, v15.4s, v6.4s
  182. fmin v16.4s, v16.4s, v6.4s
  183. fmin v17.4s, v17.4s, v6.4s
  184. fmin v18.4s, v18.4s, v6.4s
  185. fmin v19.4s, v19.4s, v6.4s
  186. fmin v20.4s, v20.4s, v6.4s
  187. fmin v21.4s, v21.4s, v6.4s
  188. fmin v22.4s, v22.4s, v6.4s
  189. fmin v23.4s, v23.4s, v6.4s
  190. fmin v24.4s, v24.4s, v6.4s
  191. fmin v25.4s, v25.4s, v6.4s
  192. fmin v26.4s, v26.4s, v6.4s
  193. StoreLH8:
  194. Float32ToBf16 v7, v8, v9, v10
  195. Float32ToBf16 v11, v12, v13, v14
  196. Float32ToBf16 v15, v16, v17, v18
  197. Float32ToBf16 v19, v20, v21, v22
  198. Float32ToBf16 v23, v24, v25, v26
  199. Float32ToBf16 v27, v28, v29, v30
  200. st1 {v7.4h, v8.4h, v9.4h, v10.4h}, [x0], #32 // 16 * sizeof(int16_t)
  201. st1 {v11.4h, v12.4h, v13.4h, v14.4h}, [x0], #32 // 16 * sizeof(int16_t)
  202. st1 {v15.4h, v16.4h, v17.4h, v18.4h}, [x0], #32 // 16 * sizeof(int16_t)
  203. add x0, x0, x14
  204. st1 {v19.4h, v20.4h, v21.4h, v22.4h}, [x0], #32 // 16 * sizeof(int16_t)
  205. st1 {v23.4h, v24.4h, v25.4h, v26.4h}, [x0], #32 // 16 * sizeof(int16_t)
  206. st1 {v27.4h, v28.4h, v29.4h, v30.4h}, [x0], #32 // 16 * sizeof(int16_t)
  207. add x0, x0, x14
  208. add x2, x2, x7 // weight stride
  209. sub x10, x10, #2
  210. cmp x10, #2
  211. bge LoopH
  212. LH4:
  213. cbz x10, End
  214. LoopHR:
  215. mov x15, x1
  216. mov x12, x9
  217. cbz x5, NoBiasH4
  218. ld1 {v0.4h}, [x5], #8 // 8 * sizeof(int16_t)
  219. shll v0.4s, v0.4h, #16
  220. mov v2.16b, v0.16b
  221. uzp1 v18.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  222. uzp2 v19.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  223. SET_BIAS v18, v8, v10, v12, v14
  224. mov v16.16b, v18.16b
  225. SET_BIAS v19, v9, v11, v13, v15
  226. mov v17.16b, v19.16b
  227. b LoopLR
  228. NoBiasH4:
  229. SET_ZERO v8, v9, v10, v11
  230. SET_ZERO v12, v13, v14, v15
  231. SET_ZERO v16, v17, v18, v19
  232. LoopLR:
  233. // A [12, 4, bf16] : rn = 6 : v2 - v7
  234. // B [ 4, 4, bf16] : rn = 2 : v0 - v1
  235. // C [12, 4, fp32] : rn = 12 : v8 - v19
  236. ld1 {v2.8h, v3.8h, v4.8h, v5.8h}, [x15], #64 // A: 8 * 4 * sizeof(int16_t)
  237. ld1 {v6.8h, v7.8h}, [x15], #32 // A: 4 * 4 * sizeof(int16_t)
  238. ld1 {v0.8h, v1.8h}, [x2], x11 // B: 4 * 4 * sizeof(int16_t)
  239. .inst 0x6e40ec48 // bfmmla v8.4s, v2.8h, v0.8h
  240. .inst 0x6e41ec49 // bfmmla v9.4s, v2.8h, v1.8h
  241. .inst 0x6e40ec6a // bfmmla v10.4s, v3.8h, v0.8h
  242. .inst 0x6e41ec6b // bfmmla v11.4s, v3.8h, v1.8h
  243. .inst 0x6e40ec8c // bfmmla v12.4s, v4.8h, v0.8h
  244. .inst 0x6e41ec8d // bfmmla v13.4s, v4.8h, v1.8h
  245. .inst 0x6e40ecae // bfmmla v14.4s, v5.8h, v0.8h
  246. .inst 0x6e41ecaf // bfmmla v15.4s, v5.8h, v1.8h
  247. .inst 0x6e40ecd0 // bfmmla v16.4s, v6.8h, v0.8h
  248. .inst 0x6e41ecd1 // bfmmla v17.4s, v6.8h, v1.8h
  249. .inst 0x6e40ecf2 // bfmmla v18.4s, v7.8h, v0.8h
  250. .inst 0x6e41ecf3 // bfmmla v19.4s, v7.8h, v1.8h
  251. subs x12, x12, #1
  252. bgt LoopLR
  253. LoopLREnd:
  254. add x2, x2, x7 // weight stride
  255. uzp1 v7.2d, v8.2d, v9.2d
  256. uzp2 v8.2d, v8.2d, v9.2d
  257. uzp1 v9.2d, v10.2d, v11.2d
  258. uzp2 v10.2d, v10.2d, v11.2d
  259. uzp1 v11.2d, v12.2d, v13.2d
  260. uzp2 v12.2d, v12.2d, v13.2d
  261. uzp1 v13.2d, v14.2d, v15.2d
  262. uzp2 v14.2d, v14.2d, v15.2d
  263. uzp1 v15.2d, v16.2d, v17.2d
  264. uzp2 v16.2d, v16.2d, v17.2d
  265. uzp1 v17.2d, v18.2d, v19.2d
  266. uzp2 v18.2d, v18.2d, v19.2d
  267. cbz x4, StoreLH4
  268. PostTreatLH4:
  269. dup v5.4s, w17
  270. dup v6.4s, w18
  271. fmax v7.4s, v7.4s, v5.4s
  272. fmax v8.4s, v8.4s, v5.4s
  273. fmax v9.4s, v9.4s, v5.4s
  274. fmax v10.4s, v10.4s, v5.4s
  275. fmax v11.4s, v11.4s, v5.4s
  276. fmax v12.4s, v12.4s, v5.4s
  277. fmax v13.4s, v13.4s, v5.4s
  278. fmax v14.4s, v14.4s, v5.4s
  279. fmax v15.4s, v15.4s, v5.4s
  280. fmax v16.4s, v16.4s, v5.4s
  281. fmax v17.4s, v17.4s, v5.4s
  282. fmax v18.4s, v18.4s, v5.4s
  283. fmin v7.4s, v7.4s, v6.4s
  284. fmin v8.4s, v8.4s, v6.4s
  285. fmin v9.4s, v9.4s, v6.4s
  286. fmin v10.4s, v10.4s, v6.4s
  287. fmin v11.4s, v11.4s, v6.4s
  288. fmin v12.4s, v12.4s, v6.4s
  289. fmin v13.4s, v13.4s, v6.4s
  290. fmin v14.4s, v14.4s, v6.4s
  291. fmin v15.4s, v15.4s, v6.4s
  292. fmin v16.4s, v16.4s, v6.4s
  293. fmin v17.4s, v17.4s, v6.4s
  294. fmin v18.4s, v18.4s, v6.4s
  295. StoreLH4:
  296. Float32ToBf16 v7, v8, v9, v10
  297. Float32ToBf16 v11, v12, v13, v14
  298. Float32ToBf16 v15, v16, v17, v18
  299. st1 {v7.4h, v8.4h, v9.4h, v10.4h}, [x0], #32 // 16 * sizeof(int16_t)
  300. st1 {v11.4h, v12.4h, v13.4h, v14.4h}, [x0], #32 // 16 * sizeof(int16_t)
  301. st1 {v15.4h, v16.4h, v17.4h, v18.4h}, [x0], #32 // 16 * sizeof(int16_t)
  302. End:
  303. ldp d8, d9, [sp, #48]
  304. ldp d10, d11, [sp, #32]
  305. ldp d12, d13, [sp, #16]
  306. ldp d14, d15, [sp], #64
  307. ret
  308. #endif

        ARMV86_MNNPackedMatMulRemain_BF16.S

  1. //
  2. // ARMV86_MNNPackedMatMulRemain_BF16.S
  3. // MNN
  4. //
  5. // Created by MNN on 2022/10/09.
  6. // Copyright © 2018-2021 Alibaba Group Holding Limited
  7. //
  8. #ifdef __aarch64__
  9. #include "MNNAsmGlobal.h"
  10. .text
  11. .align 5
  12. .macro SET_ZERO d0, d1, d2, d3
  13. movi \d0\().4s, #0
  14. movi \d1\().4s, #0
  15. movi \d2\().4s, #0
  16. movi \d3\().4s, #0
  17. .endm
  18. .macro Float32ToBf16 d0, d1, d2, d3
  19. shrn \d0\().4h, \d0\().4s, #16
  20. shrn \d1\().4h, \d1\().4s, #16
  21. shrn \d2\().4h, \d2\().4s, #16
  22. shrn \d3\().4h, \d3\().4s, #16
  23. .endm
  24. .macro SET_BIAS s, d0, d1, d2
  25. mov \d0\().16b, \s\().16b
  26. mov \d1\().16b, \s\().16b
  27. mov \d2\().16b, \s\().16b
  28. .endm
  29. // 12 * 8 * 4 MatMul
  30. asm_function ARMV86_MNNPackedMatMulRemain_BF16
  31. //void ARMV86_MNNPackedMatMulRemain_BF16(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias);
  32. //Auto x0: C, x1:A, x2:B, x3:eSize, x4:parameter, x5:postParameters, x6:bias
  33. sub sp, sp, #32
  34. str x19, [sp, #0]
  35. str x20, [sp, #8]
  36. str x21, [sp, #16]
  37. ldr x11, [x4, #0] // aStride
  38. ldr x9, [x4, #8] // l
  39. ldr x10, [x4, #16] // h
  40. lsl x11, x11, #2 // aStride * 4
  41. mov x16, #64 // B_stride = LP * HP = 4 * 8 * sizeof(int16_t)
  42. ldr x7, [x4, #24] // cStride
  43. ldr x19, [x4, #40] // bExtraStride
  44. add x10, x10, #3
  45. lsr x10, x10, #2
  46. add x9, x9, #3
  47. lsr x9, x9, #2
  48. cbz x5, Start
  49. ld1 {v5.4s}, [x5]
  50. dup v9.4s, v5.s[2] // Min Value
  51. dup v10.4s, v5.s[3] // Max Value
  52. Start:
  53. E8:
  54. cmp x3, #8
  55. blt E4
  56. LoopE8: // e, TILE_BLOCK size is 8
  57. mov x20, x6 // bias
  58. mov x8, x10 // updiv(h, 4)
  59. mov x21, x0 // dest, C
  60. mov x13, x2 // weight, B
  61. LH8:
  62. cmp x8, #2 // h/4 > 2
  63. blt LH4
  64. sub x14, x7, #64 // cStride - 64
  65. LoopH8x8:
  66. mov x15, x1 // src, A
  67. mov x12, x9 // l
  68. cbz x5, NoBiasLH8
  69. ld1 {v0.4h, v1.4h}, [x20], #16 // 8 * sizeof(int16_t)
  70. shll v0.4s, v0.4h, #16
  71. shll v1.4s, v1.4h, #16
  72. mov v2.16b, v0.16b
  73. mov v3.16b, v1.16b
  74. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  75. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  76. uzp1 v24.2d, v1.2d, v3.2d // bias_0, bias_1, bias_0, bias_1
  77. uzp2 v25.2d, v1.2d, v3.2d // bias_2, bias_3, bias_2, bias_3
  78. SET_BIAS v16, v18, v20, v22
  79. SET_BIAS v17, v19, v21, v23
  80. SET_BIAS v24, v26, v28, v30
  81. SET_BIAS v25, v27, v29, v31
  82. b LoopL
  83. NoBiasLH8:
  84. SET_ZERO v16, v17, v18, v19
  85. SET_ZERO v20, v21, v22, v23
  86. SET_ZERO v24, v25, v26, v27
  87. SET_ZERO v28, v29, v30, v31
  88. LoopL:
  89. // A [8, 4, bf16] : rn = 4 : v4 - v7
  90. // B [8, 4, bf16] : rn = 4 : v0 - v3
  91. // C [8, 8, fp32] : rn = 16 : v16 - v31
  92. ld1 {v4.8h, v5.8h, v6.8h, v7.8h}, [x15], x11 // A: 8 * 4 * sizeof(int16_t)
  93. ld1 {v0.8h, v1.8h, v2.8h, v3.8h}, [x13], x16 // B: 8 * 4 * sizeof(int16_t)
  94. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  95. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  96. .inst 0x6e40ecb2 // bfmmla v18.4s, v5.8h, v0.8h
  97. .inst 0x6e41ecb3 // bfmmla v19.4s, v5.8h, v1.8h
  98. .inst 0x6e40ecd4 // bfmmla v20.4s, v6.8h, v0.8h
  99. .inst 0x6e41ecd5 // bfmmla v21.4s, v6.8h, v1.8h
  100. .inst 0x6e40ecf6 // bfmmla v22.4s, v7.8h, v0.8h
  101. .inst 0x6e41ecf7 // bfmmla v23.4s, v7.8h, v1.8h
  102. .inst 0x6e42ec98 // bfmmla v24.4s, v4.8h, v2.8h
  103. .inst 0x6e43ec99 // bfmmla v25.4s, v4.8h, v3.8h
  104. .inst 0x6e42ecba // bfmmla v26.4s, v5.8h, v2.8h
  105. .inst 0x6e43ecbb // bfmmla v27.4s, v5.8h, v3.8h
  106. .inst 0x6e42ecdc // bfmmla v28.4s, v6.8h, v2.8h
  107. .inst 0x6e43ecdd // bfmmla v29.4s, v6.8h, v3.8h
  108. .inst 0x6e42ecfe // bfmmla v30.4s, v7.8h, v2.8h
  109. .inst 0x6e43ecff // bfmmla v31.4s, v7.8h, v3.8h
  110. subs x12, x12, #1
  111. bgt LoopL
  112. LoopLEnd:
  113. uzp1 v15.2d, v16.2d, v17.2d
  114. uzp2 v16.2d, v16.2d, v17.2d
  115. uzp1 v17.2d, v18.2d, v19.2d
  116. uzp2 v18.2d, v18.2d, v19.2d
  117. uzp1 v19.2d, v20.2d, v21.2d
  118. uzp2 v20.2d, v20.2d, v21.2d
  119. uzp1 v21.2d, v22.2d, v23.2d
  120. uzp2 v22.2d, v22.2d, v23.2d
  121. uzp1 v23.2d, v24.2d, v25.2d
  122. uzp2 v24.2d, v24.2d, v25.2d
  123. uzp1 v25.2d, v26.2d, v27.2d
  124. uzp2 v26.2d, v26.2d, v27.2d
  125. uzp1 v27.2d, v28.2d, v29.2d
  126. uzp2 v28.2d, v28.2d, v29.2d
  127. uzp1 v29.2d, v30.2d, v31.2d
  128. uzp2 v30.2d, v30.2d, v31.2d
  129. cbz x5, StoreLH8
  130. PostTreatLH8:
  131. fmax v15.4s, v15.4s, v9.4s
  132. fmax v16.4s, v16.4s, v9.4s
  133. fmax v17.4s, v17.4s, v9.4s
  134. fmax v18.4s, v18.4s, v9.4s
  135. fmax v19.4s, v19.4s, v9.4s
  136. fmax v20.4s, v20.4s, v9.4s
  137. fmax v21.4s, v21.4s, v9.4s
  138. fmax v22.4s, v22.4s, v9.4s
  139. fmax v23.4s, v23.4s, v9.4s
  140. fmax v24.4s, v24.4s, v9.4s
  141. fmax v25.4s, v25.4s, v9.4s
  142. fmax v26.4s, v26.4s, v9.4s
  143. fmax v27.4s, v27.4s, v9.4s
  144. fmax v28.4s, v28.4s, v9.4s
  145. fmax v29.4s, v29.4s, v9.4s
  146. fmax v30.4s, v30.4s, v9.4s
  147. fmin v15.4s, v15.4s, v10.4s
  148. fmin v16.4s, v16.4s, v10.4s
  149. fmin v17.4s, v17.4s, v10.4s
  150. fmin v18.4s, v18.4s, v10.4s
  151. fmin v19.4s, v19.4s, v10.4s
  152. fmin v20.4s, v20.4s, v10.4s
  153. fmin v21.4s, v21.4s, v10.4s
  154. fmin v22.4s, v22.4s, v10.4s
  155. fmin v23.4s, v23.4s, v10.4s
  156. fmin v24.4s, v24.4s, v10.4s
  157. fmin v25.4s, v25.4s, v10.4s
  158. fmin v26.4s, v26.4s, v10.4s
  159. fmin v27.4s, v27.4s, v10.4s
  160. fmin v28.4s, v28.4s, v10.4s
  161. fmin v29.4s, v29.4s, v10.4s
  162. fmin v30.4s, v30.4s, v10.4s
  163. StoreLH8:
  164. Float32ToBf16 v15, v16, v17, v18
  165. Float32ToBf16 v19, v20, v21, v22
  166. Float32ToBf16 v23, v24, v25, v26
  167. Float32ToBf16 v27, v28, v29, v30
  168. st1 {v15.4h, v16.4h, v17.4h, v18.4h}, [x0], #32 // 16 * sizeof(int16_t)
  169. st1 {v19.4h, v20.4h, v21.4h, v22.4h}, [x0], #32 // 16 * sizeof(int16_t)
  170. add x0, x0, x14
  171. st1 {v23.4h, v24.4h, v25.4h, v26.4h}, [x0], #32 // 16 * sizeof(int16_t)
  172. st1 {v27.4h, v28.4h, v29.4h, v30.4h}, [x0], #32 // 16 * sizeof(int16_t)
  173. add x0, x0, x14
  174. add x13, x13, x19 // weight stride
  175. sub x8, x8, #2
  176. cmp x8, #2
  177. bge LoopH8x8
  178. LH4:
  179. cbz x8, E8End
  180. LoopHRemain:
  181. mov x15, x1
  182. mov x12, x9
  183. cbz x5, NoBiasHRemain
  184. ld1 {v0.4h}, [x20]
  185. shll v0.4s, v0.4h, #16
  186. mov v2.16b, v0.16b
  187. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  188. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  189. SET_BIAS v16, v18, v20, v22
  190. SET_BIAS v17, v19, v21, v23
  191. b LoopLR
  192. NoBiasHRemain:
  193. SET_ZERO v16, v17, v18, v19
  194. SET_ZERO v20, v21, v22, v23
  195. LoopLR:
  196. // A [8, 4, bf16] : rn = 4 : v4 - v7
  197. // B [4, 4, bf16] : rn = 2 : v0 - v1
  198. // C [8, 4, fp32] : rn = 8 : v16 - v23
  199. ld1 {v4.8h, v5.8h, v6.8h, v7.8h}, [x15], x11 // A: 8 * 4 * sizeof(int16_t)
  200. ld1 {v0.8h, v1.8h}, [x13], x16 // B: 4 * 4 * sizeof(int16_t)
  201. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  202. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  203. .inst 0x6e40ecb2 // bfmmla v18.4s, v5.8h, v0.8h
  204. .inst 0x6e41ecb3 // bfmmla v19.4s, v5.8h, v1.8h
  205. .inst 0x6e40ecd4 // bfmmla v20.4s, v6.8h, v0.8h
  206. .inst 0x6e41ecd5 // bfmmla v21.4s, v6.8h, v1.8h
  207. .inst 0x6e40ecf6 // bfmmla v22.4s, v7.8h, v0.8h
  208. .inst 0x6e41ecf7 // bfmmla v23.4s, v7.8h, v1.8h
  209. subs x12, x12, #1
  210. bne LoopLR
  211. LoopLREnd:
  212. uzp1 v15.2d, v16.2d, v17.2d
  213. uzp2 v16.2d, v16.2d, v17.2d
  214. uzp1 v17.2d, v18.2d, v19.2d
  215. uzp2 v18.2d, v18.2d, v19.2d
  216. uzp1 v19.2d, v20.2d, v21.2d
  217. uzp2 v20.2d, v20.2d, v21.2d
  218. uzp1 v21.2d, v22.2d, v23.2d
  219. uzp2 v22.2d, v22.2d, v23.2d
  220. cbz x5, StoreLH8x4
  221. PostTreatLH8x4:
  222. fmax v15.4s, v15.4s, v9.4s
  223. fmax v16.4s, v16.4s, v9.4s
  224. fmax v17.4s, v17.4s, v9.4s
  225. fmax v18.4s, v18.4s, v9.4s
  226. fmax v19.4s, v19.4s, v9.4s
  227. fmax v20.4s, v20.4s, v9.4s
  228. fmax v21.4s, v21.4s, v9.4s
  229. fmax v22.4s, v22.4s, v9.4s
  230. fmin v15.4s, v15.4s, v10.4s
  231. fmin v16.4s, v16.4s, v10.4s
  232. fmin v17.4s, v17.4s, v10.4s
  233. fmin v18.4s, v18.4s, v10.4s
  234. fmin v19.4s, v19.4s, v10.4s
  235. fmin v20.4s, v20.4s, v10.4s
  236. fmin v21.4s, v21.4s, v10.4s
  237. fmin v22.4s, v22.4s, v10.4s
  238. StoreLH8x4:
  239. Float32ToBf16 v15, v16, v17, v18
  240. Float32ToBf16 v19, v20, v21, v22
  241. st1 {v15.4h, v16.4h, v17.4h, v18.4h}, [x0], #32 // 16 * sizeof(int16_t)
  242. st1 {v19.4h, v20.4h, v21.4h, v22.4h}, [x0], #32 // 16 * sizeof(int16_t)
  243. E8End:
  244. sub x3, x3, #8
  245. cmp x3, #8
  246. add x0, x21, #64 // move dest address of 8 * 4 * sizeof(int16_t)
  247. add x1, x1, #64 // move A matrix address of 8 * 4 * sizeof(int16_t)
  248. bge LoopE8
  249. E4:
  250. cmp x3, #4
  251. mov x20, x6
  252. blt E2
  253. mov x8, x10
  254. mov x21, x0
  255. mov x13, x2
  256. cmp x8, #2
  257. blt E4LH4
  258. E4LH8:
  259. E4LoopH8:
  260. mov x15, x1
  261. mov x12, x9
  262. cbz x5, NoBiasE4
  263. ld1 {v0.4h, v1.4h}, [x20], #16 // 8 * sizeof(int16_t)
  264. shll v0.4s, v0.4h, #16
  265. shll v1.4s, v1.4h, #16
  266. mov v2.16b, v0.16b
  267. mov v3.16b, v1.16b
  268. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  269. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  270. uzp1 v20.2d, v1.2d, v3.2d // bias_0, bias_1, bias_0, bias_1
  271. uzp2 v21.2d, v1.2d, v3.2d // bias_2, bias_3, bias_2, bias_3
  272. mov v18.16b, v16.16b
  273. mov v19.16b, v17.16b
  274. mov v22.16b, v20.16b
  275. mov v23.16b, v21.16b
  276. b E4LoopL
  277. NoBiasE4:
  278. SET_ZERO v16, v17, v18, v19
  279. SET_ZERO v20, v21, v22, v23
  280. E4LoopL:
  281. // A [4, 4, bf16] : rn = 4 : v4 - v5
  282. // B [8, 4, bf16] : rn = 4 : v0 - v3
  283. // C [4, 8, fp32] : rn = 8 : v16 - v23
  284. ld1 {v4.8h, v5.8h}, [x15], x11 // A: 4 * 4 * sizeof(int16_t)
  285. ld1 {v0.8h, v1.8h, v2.8h, v3.8h}, [x13], x16 // B: 8 * 4 * sizeof(int16_t)
  286. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  287. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  288. .inst 0x6e40ecb2 // bfmmla v18.4s, v5.8h, v0.8h
  289. .inst 0x6e41ecb3 // bfmmla v19.4s, v5.8h, v1.8h
  290. .inst 0x6e42ec94 // bfmmla v20.4s, v4.8h, v2.8h
  291. .inst 0x6e43ec95 // bfmmla v21.4s, v4.8h, v3.8h
  292. .inst 0x6e42ecb6 // bfmmla v22.4s, v5.8h, v2.8h
  293. .inst 0x6e43ecb7 // bfmmla v23.4s, v5.8h, v3.8h
  294. subs x12, x12, #1
  295. bgt E4LoopL
  296. E4LoopLEnd:
  297. uzp1 v15.2d, v16.2d, v17.2d
  298. uzp2 v16.2d, v16.2d, v17.2d
  299. uzp1 v17.2d, v18.2d, v19.2d
  300. uzp2 v18.2d, v18.2d, v19.2d
  301. uzp1 v19.2d, v20.2d, v21.2d
  302. uzp2 v20.2d, v20.2d, v21.2d
  303. uzp1 v21.2d, v22.2d, v23.2d
  304. uzp2 v22.2d, v22.2d, v23.2d
  305. cbz x5, StoreLH4x8
  306. PostTreatLH4x8:
  307. fmax v15.4s, v15.4s, v9.4s
  308. fmax v16.4s, v16.4s, v9.4s
  309. fmax v17.4s, v17.4s, v9.4s
  310. fmax v18.4s, v18.4s, v9.4s
  311. fmax v19.4s, v19.4s, v9.4s
  312. fmax v20.4s, v20.4s, v9.4s
  313. fmax v21.4s, v21.4s, v9.4s
  314. fmax v22.4s, v22.4s, v9.4s
  315. fmin v15.4s, v15.4s, v10.4s
  316. fmin v16.4s, v16.4s, v10.4s
  317. fmin v17.4s, v17.4s, v10.4s
  318. fmin v18.4s, v18.4s, v10.4s
  319. fmin v19.4s, v19.4s, v10.4s
  320. fmin v20.4s, v20.4s, v10.4s
  321. fmin v21.4s, v21.4s, v10.4s
  322. fmin v22.4s, v22.4s, v10.4s
  323. StoreLH4x8:
  324. Float32ToBf16 v15, v16, v17, v18
  325. Float32ToBf16 v19, v20, v21, v22
  326. st1 {v15.4h, v16.4h, v17.4h, v18.4h}, [x0], x7 // 16 * sizeof(int16_t)
  327. st1 {v19.4h, v20.4h, v21.4h, v22.4h}, [x0], x7 // 16 * sizeof(int16_t)
  328. add x13, x13, x19 // weight stride
  329. sub x8, x8, #2
  330. cmp x8, #2
  331. bge E4LoopH8
  332. E4LH4:
  333. cbz x8, E4End
  334. mov x15, x1
  335. mov x12, x9
  336. cbz x5, NoBiasE4R
  337. ld1 {v0.4h}, [x20]
  338. shll v0.4s, v0.4h, #16
  339. mov v2.16b, v0.16b
  340. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  341. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  342. mov v18.16b, v16.16b
  343. mov v19.16b, v17.16b
  344. b E4LoopLR
  345. NoBiasE4R:
  346. SET_ZERO v16, v17, v18, v19
  347. E4LoopLR:
  348. // A [4, 4, bf16] : rn = 4 : v4 - v5
  349. // B [4, 4, bf16] : rn = 4 : v0 - v1
  350. // C [4, 4, fp32] : rn = 4 : v16 - v19
  351. ld1 {v4.8h, v5.8h}, [x15], x11 // A: 4 * 4 * sizeof(int16_t)
  352. ld1 {v0.8h, v1.8h}, [x13], x16 // B: 4 * 4 * sizeof(int16_t)
  353. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  354. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  355. .inst 0x6e40ecb2 // bfmmla v18.4s, v5.8h, v0.8h
  356. .inst 0x6e41ecb3 // bfmmla v19.4s, v5.8h, v1.8h
  357. subs x12, x12, #1
  358. bgt E4LoopLR
  359. E4LoopLREnd:
  360. uzp1 v15.2d, v16.2d, v17.2d
  361. uzp2 v16.2d, v16.2d, v17.2d
  362. uzp1 v17.2d, v18.2d, v19.2d
  363. uzp2 v18.2d, v18.2d, v19.2d
  364. cbz x5, StoreLH4x4
  365. PostTreatLH4x4:
  366. fmax v15.4s, v15.4s, v9.4s
  367. fmax v16.4s, v16.4s, v9.4s
  368. fmax v17.4s, v17.4s, v9.4s
  369. fmax v18.4s, v18.4s, v9.4s
  370. fmin v19.4s, v19.4s, v10.4s
  371. fmin v20.4s, v20.4s, v10.4s
  372. fmin v21.4s, v21.4s, v10.4s
  373. fmin v22.4s, v22.4s, v10.4s
  374. StoreLH4x4:
  375. Float32ToBf16 v15, v16, v17, v18
  376. st1 {v15.4h, v16.4h, v17.4h, v18.4h}, [x0] // 16 * sizeof(int16_t)
  377. E4End:
  378. sub x3, x3, #4
  379. add x0, x21, #32 // move dest address of 4 * 4 * sizeof(int16_t)
  380. add x1, x1, #32 // move dest address of 4 * 4 * sizeof(int16_t)
  381. E2:
  382. cmp x3, #2
  383. mov x20, x6
  384. blt E1
  385. mov x8, x10
  386. mov x21, x0
  387. mov x13, x2
  388. cmp x8, #2
  389. blt E2LH4
  390. E2LH8:
  391. E2LoopH8:
  392. mov x15, x1
  393. mov x12, x9
  394. cbz x5, NoBiasE2
  395. ld1 {v0.4h, v1.4h}, [x20], #16
  396. shll v0.4s, v0.4h, #16
  397. shll v1.4s, v1.4h, #16
  398. mov v2.16b, v0.16b
  399. mov v3.16b, v1.16b
  400. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  401. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  402. uzp1 v18.2d, v1.2d, v3.2d // bias_0, bias_1, bias_0, bias_1
  403. uzp2 v19.2d, v1.2d, v3.2d // bias_2, bias_3, bias_2, bias_3
  404. b E2LoopL
  405. NoBiasE2:
  406. SET_ZERO v16, v17, v18, v19
  407. E2LoopL:
  408. // A [2, 4, bf16] : rn = 1 : v4
  409. // B [8, 4, bf16] : rn = 2 : v0 - v3
  410. // C [2, 8, fp32] : rn = 4 : v16 - v19
  411. ld1 {v4.8h}, [x15], x11 // A: 2 * 4 * sizeof(int16_t)
  412. ld1 {v0.8h, v1.8h, v2.8h, v3.8h}, [x13], x16 // B: 8 * 4 * sizeof(int16_t)
  413. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  414. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  415. .inst 0x6e42ec92 // bfmmla v18.4s, v4.8h, v2.8h
  416. .inst 0x6e43ec93 // bfmmla v19.4s, v4.8h, v3.8h
  417. subs x12, x12, #1
  418. bgt E2LoopL
  419. E2LoopLEnd:
  420. uzp1 v15.2d, v16.2d, v17.2d
  421. uzp2 v16.2d, v16.2d, v17.2d
  422. uzp1 v17.2d, v18.2d, v19.2d
  423. uzp2 v18.2d, v18.2d, v19.2d
  424. cbz x5, StoreLH2x8
  425. PostTreatLH2x8:
  426. fmax v15.4s, v15.4s, v9.4s
  427. fmax v16.4s, v16.4s, v9.4s
  428. fmax v17.4s, v17.4s, v9.4s
  429. fmax v18.4s, v18.4s, v9.4s
  430. fmin v15.4s, v15.4s, v10.4s
  431. fmin v16.4s, v16.4s, v10.4s
  432. fmin v17.4s, v17.4s, v10.4s
  433. fmin v18.4s, v18.4s, v10.4s
  434. StoreLH2x8:
  435. Float32ToBf16 v15, v16, v17, v18
  436. st1 {v15.4h, v16.4h}, [x0], x7 // 8 * sizeof(int16_t)
  437. st1 {v17.4h, v18.4h}, [x0], x7 // 8 * sizeof(int16_t)
  438. add x13, x13, x19 // weight stride
  439. sub x8, x8, #2
  440. cmp x8, #2
  441. bge E2LoopH8
  442. E2LH4:
  443. cbz x8, E2End
  444. mov x15, x1
  445. mov x12, x9
  446. cbz x5, NoBiasE2R
  447. ld1 {v0.4h}, [x20]
  448. shll v0.4s, v0.4h, #16
  449. mov v2.16b, v0.16b
  450. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  451. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  452. b E2LoopLR
  453. NoBiasE2R:
  454. movi v16.4s, #0
  455. movi v17.4s, #0
  456. E2LoopLR:
  457. // A [2, 4, bf16] : rn = 1 : v4
  458. // B [4, 4, bf16] : rn = 2 : v0 - v1
  459. // C [2, 4, fp32] : rn = 2 : v16 - v17
  460. ld1 {v4.8h}, [x15], x11 // A: 2 * 4 * sizeof(int16_t)
  461. ld1 {v0.8h, v1.8h}, [x13], x16 // B: 4 * 4 * sizeof(int16_t)
  462. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  463. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  464. subs x12, x12, #1
  465. bgt E2LoopLR
  466. E2LoopLREnd:
  467. uzp1 v15.2d, v16.2d, v17.2d
  468. uzp2 v16.2d, v16.2d, v17.2d
  469. cbz x5, StoreLH2x4
  470. PostTreatLH2x4:
  471. fmax v15.4s, v15.4s, v9.4s
  472. fmax v16.4s, v16.4s, v9.4s
  473. fmin v15.4s, v15.4s, v10.4s
  474. fmin v16.4s, v16.4s, v10.4s
  475. StoreLH2x4:
  476. shrn v15.4h, v15.4s, #16
  477. shrn v16.4h, v16.4s, #16
  478. st1 {v15.4h, v16.4h}, [x0] // 8 * sizeof(int16_t)
  479. E2End:
  480. sub x3, x3, #2
  481. add x0, x21, #16 // move dest address of 2 * 4 * sizeof(int16_t)
  482. add x1, x1, #16 // move dest address of 2 * 4 * sizeof(int16_t)
  483. E1:
  484. cmp x3, #0
  485. beq End
  486. LoopE1:
  487. mov x20, x6
  488. mov x8, x10
  489. mov x21, x0
  490. mov x13, x2
  491. cmp x8, #2
  492. blt E1LH4
  493. E1LH8:
  494. E1LoopH8:
  495. mov x15, x1
  496. mov x12, x9
  497. cbz x5, NoBiasE1
  498. ld1 {v0.4h, v1.4h}, [x20], #16
  499. shll v0.4s, v0.4h, #16
  500. shll v1.4s, v1.4h, #16
  501. mov v2.16b, v0.16b
  502. mov v3.16b, v1.16b
  503. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  504. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  505. uzp1 v18.2d, v1.2d, v3.2d // bias_0, bias_1, bias_0, bias_1
  506. uzp2 v19.2d, v1.2d, v3.2d // bias_2, bias_3, bias_2, bias_3
  507. b E1LoopL
  508. NoBiasE1:
  509. SET_ZERO v16, v17, v18, v19
  510. E1LoopL:
  511. // A [1, 4, bf16] : rn = 1 : v4
  512. // B [8, 4, bf16] : rn = 4 : v0 - v3
  513. // C [1, 8, fp32] : rn = 4 : v16 - v19
  514. ld1 {v4.4h}, [x15], x11 // A: 1 * 4 * sizeof(int16_t)
  515. ld1 {v0.8h, v1.8h, v2.8h, v3.8h}, [x13], x16 // B: 8 * 4 * sizeof(int16_t)
  516. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  517. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  518. .inst 0x6e42ec92 // bfmmla v18.4s, v4.8h, v2.8h
  519. .inst 0x6e43ec93 // bfmmla v19.4s, v4.8h, v3.8h
  520. subs x12, x12, #1
  521. bgt E1LoopL
  522. E1LoopLEnd:
  523. // v16-v19: [r0, r1, 0, 0]
  524. uzp1 v15.2d, v16.2d, v17.2d
  525. uzp1 v16.2d, v18.2d, v19.2d
  526. cbz x5, StoreLH1x8
  527. PostTreatLH1x8:
  528. fmax v15.4s, v15.4s, v9.4s
  529. fmax v16.4s, v16.4s, v9.4s
  530. fmin v15.4s, v15.4s, v10.4s
  531. fmin v16.4s, v16.4s, v10.4s
  532. StoreLH1x8:
  533. shrn v15.4h, v15.4s, #16
  534. shrn v16.4h, v16.4s, #16
  535. st1 {v15.4h}, [x0], x7
  536. st1 {v16.4h}, [x0], x7
  537. add x13, x13, x19
  538. sub x8, x8, #2
  539. cmp x8, #2
  540. bge E1LoopH8
  541. E1LH4:
  542. cbz x8, E1End
  543. mov x15, x1
  544. mov x12, x9
  545. cbz x5, NoBiasE1R
  546. ld1 {v0.4h}, [x20]
  547. shll v0.4s, v0.4h, #16
  548. mov v2.16b, v0.16b
  549. uzp1 v16.2d, v0.2d, v2.2d // bias_0, bias_1, bias_0, bias_1
  550. uzp2 v17.2d, v0.2d, v2.2d // bias_2, bias_3, bias_2, bias_3
  551. b E1LoopLR
  552. NoBiasE1R:
  553. movi v16.4s, #0
  554. movi v17.4s, #0
  555. E1LoopLR:
  556. // A [1, 4, bf16] : rn = 1 : v4
  557. // B [4, 4, bf16] : rn = 2 : v0 - v1
  558. // C [1, 8, fp32] : rn = 4 : v16 - v17
  559. ld1 {v4.4h}, [x15], x11 // A: 1 * 4 * sizeof(int16_t)
  560. ld1 {v0.8h, v1.8h}, [x13], x16 // B: 4 * 4 * sizeof(int16_t)
  561. .inst 0x6e40ec90 // bfmmla v16.4s, v4.8h, v0.8h
  562. .inst 0x6e41ec91 // bfmmla v17.4s, v4.8h, v1.8h
  563. subs x12, x12, #1
  564. bgt E1LoopLR
  565. E1LoopLREnd:
  566. uzp1 v15.2d, v16.2d, v17.2d
  567. cbz x5, StoreLH1x4
  568. PostTreatLH1x4:
  569. fmax v15.4s, v15.4s, v9.4s
  570. fmin v15.4s, v15.4s, v10.4s
  571. StoreLH1x4:
  572. shrn v15.4h, v15.4s, #16
  573. st1 {v15.4h}, [x0]
  574. E1End:
  575. subs x3, x3, #1
  576. add x0, x21, #8
  577. add x1, x1, #8
  578. bne LoopE1
  579. End:
  580. ldr x19, [sp, #0]
  581. ldr x20, [sp, #8]
  582. ldr x21, [sp, #16]
  583. add sp, sp, #32
  584. ret
  585. #endif

        编译完成后得到了三个动态库:libMNN.so、libMNNOpenCV.so、libMNN_Express.so,这三个库后面会用到。

        在前面我们已经训练得到了权重文件v5Lite-e.pt,现在要把它先转成onnx,再从onnx格式转成mnn格式。

        YOLOv5-Lite作者已提供export.py程序,执行:

  1. python export.py --mnnd --weight weights/v5lite-e.pt
  2. python -m onnxsim weights/v5lite-e.onnx weights/v5lite-e-mnnd_sim.onnx

        得到onnx格式的权重文件v5lite-e-mnnd_sim.onnx,将它放到树莓派上,在MNN的目录下执行:

python ./tools/script/testMNNFromOnnx.py YOLOv5-Lite/v5Lite-e-mnnd_sim.onnx

        出现TEST_SUCCESS未报错,继续执行: 

./build/MNNConvert -f ONNX --modelFile YOLOv5-Lite/mnnd/v5lite-e-mnnd_sim.onnx --MNNModel YOLOv5-Lite/mnnd/v5lite-e-mnnd.mnn --optimizeLevel 1 --optimizePrefer 2 --bizCode MNN --saveStaticModel --testdir val_test

        转换完成,得到mnn格式的模型权重。

四、推理

        使用YOLOv5-Lite作者提供的检测代码,位置在项目目录下的./cpp_demo/mnn,在其中的lib目录内放入前面编译MNN框架得到的三个动态库,命令行执行下面命令即可编译得到在树莓派上进行检测的程序。

  1. mkdir build && cd build
  2. cmake ..
  3. make

        MNN框架在树莓派上的推理速度比较快,我使用的是8GB的树莓派4B,用MNN框架推理能有16帧左右,而使用onnxruntime只有5到6帧。

参考博文

https://zhuanlan.zhihu.com/p/672633849

香橙派--编译MNN报错,关于汇编的嵌套展开_mnn 编译报错-CSDN博客

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/喵喵爱编程/article/detail/819707
推荐阅读
相关标签
  

闽ICP备14008679号