赞
踩
这篇文章主要分享我个人对OpenCL的学习和使用过程。
使用了一台带PCIe插槽的主机,CPU型号是i7-6700,内存32GB。
加速设备方面,我的主机插上了两种型号的加速设备,包括一块Nvidia 1080ti GPU,FPGA方面使用的是友晶科技的 Terasic Starter Platform for OpenVINO™ Toolkit (TSP)开发板,板上使用的FPGA核心型号是Cyclone V。
下面列一下FPGA的主要硬件信息:
在作为OpenCL设备使用的时候,基本只需要关注上面的几项信息,包括核心、板载内存(设备内存)、PCIe接口信息等。
把GPU和FPGA安装到主板的PCIe插槽上,我的主板型号为Z170A KRAIT GAMING,通过官网文档可以看到,主板提供了三个PCIex16的插槽:
但是由于CPU的限制,这三个插槽并不能跑满,文档里提供了两种模型:x16/x0/x4和x8/x8/x4,由于FPGA只需要x4的接口,为了确保GPU可以工作在最高速度,我选择按照模式1安装硬件,我的GPU安装在1号位,FPGA安装在3号位。
装好后的效果如上图。理论上也可以通过某些设置,使主板工作在x8/x8/x4的模式下,这样第二条PCIe插槽也可以用了,但还没仔细研究过如何设置。
在Intel官网下载OpenCL SDK,我选择了比较早期的17.1版本,跟开发板文档中使用的版本一致,减少bug。
SDK大概20GB左右,下载好后安装,安装过程中会自动把quartus和opencl sdk一起装上,安装好后的目录如下,可以手动选择安装在系统目录/opt下。
软件安装好后还没办法直接使用,每次打开quartus软件之前都需要配置一次环境,因此,建议写好一个配置环境的脚本,这样每次启动的时候只需要运行一下这个脚本就可以完成环境的配置,我的脚本如下图所示。
这个脚本的主要功能是声明一下软件和BSP的位置,然后调用init_opencl.sh配置环境,同时把quartus和opencl软件的bin目录和lib目录添加到系统环境。
进入BSP文件夹的bringup目录,烧写编译好的hello_world.aocx比特流到开发板的flash里,注意是flash,这样才能可以确保掉电不丢失。
aocl flash acl0 hello_world.aocx
烧写的过程中如果出现以下错误:Programming hardware cable not detected,可能是因为USB Blaster驱动没安装好的问题,也有可能是用户权限导致的,可以尝试在root权限下重新烧写一次。
烧写成功的效果如下:
此时一般需要断电重启电脑,让系统重新以PCIe设备识别FPGA,注意是要先完全断电。重启可以输入:
lspci | grep Altera
验证是否连接成功,我的打印如下:
出现这个提示说明系统成功识别到了FPGA,并且是PCIe加速卡的形式。
成功识别设备后,下一步要安装驱动,驱动的作用是实现FPGA的各项控制,(大概类似于安装显卡的驱动),可以看到主要是安装了一些pci的设备如dma、cvp、queue等。注意此处的驱动仍然是由开发板供应商提供的OpenCL BSP当中的驱动,不同的开发板应该安装不同的驱动,不能混用。
输入:
aocl install
此时需要root用户权限,将设备驱动安装在Host端
驱动装好后,可能需要重启电脑,之后输入下面指令进行测试:
aocl diagnose
顺利的话可以看到命令行打印了开发板的信息:
BSP里提供了一些测试的样例代码,可以尝试编译,测试是否能正确计算,编译FPGA镜像的流程如下(来源:aocl_getting_started-17-1-683188-704901):
一般来说,在上板之前可以对CL核函数进行软件仿真,对应Compile kernel for emulation步骤,不过本次测试主要是验证FPGA是否能正确计算,因此直接跳到编译硬件比特流,进入vector_add的根目录输入,开始编译:
aoc device/vector_add.cl -o bin/vector_add.aocx -board=c5gt -v
这一步的作用是将.cl文件编译成一个OpenCL镜像.aocx,相当于构建了一个专用的硬件加速器。
关于aoco文件和aocx文件的描述,我找了一段官方的描述:
- A .aoco object file is an intermediate object file that contains information for later stages of the compilation.(.aoco对象文件是一个中间对象文件,包含了后期编译阶段的信息。)
- A.aocx image file is the hardware configuration file and contains information necessary to program the FPGA at runtime.(.aocx图像文件是硬件配置文件,包含在运行时对FPGA进行编程的必要信息。)
- The .aocx file contains data that the host application uses to create program objects, a concept within the OpenCL runtime API, for the target FPGA. The host application first loads these program objects into memory. Then the host runtime uses these program objects to program the target FPGA, as required for kernel launch operations by the host program.(.aocx文件包含主机应用程序用来为目标FPGA创建程序对象的数据,这是OpenCL运行时API的一个概念。主机应用程序首先将这些程序对象加载到内存中。然后,主机运行时使用这些程序对象对目标FPGA进行编程,这是主机程序启动内核操作的需要。)
编译需要比较长的时间,即便是一个简单的设计,因为需要重新走一遍FPGA的编译流程,会比较消耗CPU和内存。
这个时间也会随着开发板的大小而变化,越大的开发板编译越慢。
*关于aoc指令:
aoc -- Intel(R) FPGA SDK for OpenCL(TM) Kernel Compiler Usage: aoc <options> <file>.[cl|aoco] Example: # First generate an <file>.aoco file aoc -c mykernels.cl # Now compile the project into a hardware programming file <file>.aocx. aoc mykernels.aoco # Or generate all at once aoc mykernels.cl Outputs: <file>.aocx and/or <file>.aoco Help Options: -version Print out version infomation and exit -v Verbose mode. Report progress of compilation -q Quiet mode. Progress of compilation is not reported -report Print area estimates to screen after intial compilation. The report is always written to the log file. -h -help Show this message Overall Options: -c Stop after generating a <file>.aoco -o <output> Use <output> as the name for the output. If running with the '-c' option the output file extension should be '.aoco'. Otherwise the file extension should be '.aocx'. If neither extension is specified, the appropriate extension will be added automatically. -march=emulator Create kernels that can be executed on x86 -g Add debug data to kernels. Also, makes it possible to symbolically debug kernels created for the emulator on an x86 machine (Linux only). This behavior is enabled by default. This flag may be used to override the -g0 flag. -g0 Don't add debug data to kernels. -profile(=<all|autorun|enqueued>) Enable profile support when generating aocx file: all: profile all kernels. autorun: profile only autorun kernels. enqueued: profile only non-autorun kernels. If there is no argument provided, then the mode defaults to 'all'. Note that this does have a small performance penalty since profile counters will be instantiated and take some FPGA resources. -shared Compile OpenCL source file into an object file that can be included into a library. Implies -c. -I <directory> Add directory to header search path. -L <directory> Add directory to OpenCL library search path. -l <library.aoclib> Specify OpenCL library file. -D <name> Define macro, as name=value or just name. -W Suppress warning. -Werror Make all warnings into errors. -library-debug Generate debug output related to libraries. Modifiers: -board=<board name> Compile for the specified board. Default is c5gt. -list-boards Print a list of available boards and exit. -bsp-flow=<flow name> Specify the bsp compilation flow by name. If none given, the board's default flow is used. Optimization Control: -no-interleaving=<global memory name> Configure a global memory as separate address spaces for each DIMM/bank. User should then use the Altera specific cl_mem_flags (E.g. CL_CHANNEL_2_INTELFPGA) to allocate each buffer in one DIMM or the other. The argument 'default' can be used to configure the default global memory. Consult your board's documentation for the memory types available. See the Best Practices Guide for more details. -const-cache-bytes=<N> Configure the constant cache size (rounded up to closest 2^n). If none of the kernels use the __constant address space, this argument has no effect. -fp-relaxed Allow the compiler to relax the order of arithmetic operations, possibly affecting the precision -fpc Removes intermediary roundings and conversions when possible, and changes the rounding mode to round towards zero for multiplies and adds -fast-compile Compiles the design with reduced effort for a faster compile time but reduced fmax and lower power efficiency. Compiled aocx should only be used for internal development and not for deploying in final product. -high-effort Increases aocx compile effort to improve ability to fit kernel on the device. -emulator-channel-depth-model=<default|strict|ignore-depth> Controls the depths of channels used by the emulator: default: Channels with explicitly-specified depths will use the specified depths. Channels with unspecified depths will use a depth >10000. strict: As default except channels of unspecified depth will use a depth of 1. ignore-depth: All channels will use a depth >10000. -cl-single-precision-constant -cl-denorms-are-zero -cl-opt-disable -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -cl-unsafe-math-optimizations -cl-finite-math-only -cl-fast-relaxed-math OpenCL required options. See OpenCL specification for details
赶时间的话应该可以加上“-fast-compile”选项,编译速度可能会比较快,但是相应的时钟频率和功耗方面就没那么理想,不过对于中间的开发过程应该影响不大。
编译完成后,进入bin文件夹,可以看到编译产生的文件:
烧写aocx文件:aocl program acl0 vector_add.aocx,aocl aocl flash的不同点在于,program不需要断电重启设备,烧写完成后就可以直接进行Host测试,其次是program的程序断电不保留,使用前最好重新写入一次。
提示program成功后,直接运行host程序,效果如下:
总耗时约为21ms,kernel耗时约为8ms,剩余的时间主要是用于数据搬运,至此,FPGA的完整OpenCL测试流程就结束了。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。