Intel核顯–OpenCL環境–Linux

Machine :Intel Xeon i7 6700(core) Intel HD Graphics 530(skyLake) Linux (CentOS 7.2或7.3,我的7.3) eclipse(Neon)安裝OpenCL環境,希望安裝成功後將Intel i7 6700作為host,HD Graphics 530作為device。

搞了半天,才知道Linux下 HD Graphics 530並沒有driver,https://downloadcenter.intel.com/product/88345/Intel-HD-Graphics-530  官網給出的都是windows下的。所以driver這步就省了。

按照:https://software.intel.com/en-us/articles/sdk-for-opencl-gsg  來安裝。

一、安裝Intel OpenCL driver

  官網給出了指令碼直接安裝:https://software.intel.com/sites/default/files/managed/f6/77/install_OCL_driver.sh_.txt  我之前都是自己新建一個txt將這段指令碼複製過去,結果安裝時總是出錯,不能複製!要直接下載。我放在 http://download.csdn.net/download/wd1603926823/10219725  可以直接下載。下載完畢以後,直接按照官網的步驟開始安裝:

$ mv install_OCL_driver.sh_.txt install_OCL_driver.sh
$ chmod 755 install_OCL_driver.sh
$ sudo su
$ ./install_OCL_driver.sh install

一直選擇y 即yes,然後出現:

然後出現一個error:因為我的linux kernel是3.10.0-514.el7.X86_64 所以報錯。不滿足install的條件。

這個提示出來,我確認我有足夠的磁碟空間,於是我又執行install一次,這次就不會報錯了。漫長的等待—-安裝完畢。

然後我按照建議1 將使用者root加入到video group了。 第2點不滿足就不管它了,然後重啟。

也可以自己開啟提示的這個網頁去手動下載linux-4.7.tar.xz  然後看完 https://linoxide.com/linux-how-to/upgrade-linux-kernel-stable-4-7-centos-7-x/ 的後部分 就會安裝了。

二、Intel_OpenCL_SDK

https://software.intel.com/en-us/articles/sdk-for-opencl-gsg 這裡給出的是Ubuntu的,沒有給CentOS的。但我還是想試試,於是下載這個(不要複製,而是直接下載),然後按照執行:

$ mv install_SDK_prereq_ubuntu.sh_.txt install_SDK_prereq_ubuntu.sh
$ sudo su
$ ./install_SDK_prereq_ubuntu.sh

結果顯示:bash: ./install_SDK_prereq_ubuntu.sh: 許可權不夠 !!!但我是root登入的!我輸入clinfo也說我未找到命令!?
後來我想了想,還是自己手動安裝SDK,去官網 https://registrationcenter.intel.com/en/products/postregistration/?sn=c69g-kd7n5h3j&[email protected]&Sequence=2151318&dnld=t 下的合適我的SDK,當然怎麼找合適的,要先看 release note :https://software.intel.com/en-us/articles/opencl-code-builder-release-notes
 看哪個版本適合我。 可以通過install.sh或者install_GUI.sh安裝,我是通過sh install_GUI.sh安裝的,在安裝期間,說我缺少dkms這個包,於是我yum install epel-release dkms 安裝,但裝了兩遍才裝成功dkms,然後繼續中斷的安裝步驟。安裝完畢會顯示成功,出來這個介面。

所以SDK就裝完了。

裝完後可以看到/opt/instel/下有4個檔案(夾):ism opencl opencl-sdk和另一個檔案。開啟資料夾opencl-sdk可以看到有include和lib64資料夾。。。應該是真的成功了。

三、檢查對OpenCL的裝置支援

但輸入clinfo說許可權不夠,於是自己去手動下載了clinfo的rpm包,然後:rpm -ivh clinfo-2.1.17.02.09-1.el7.x86_64.rpm 安裝成功。然後輸入clinfo 即可:哈哈哈終於出現2個平臺和2個devices了哈哈哈!!!!!

平臺1:Intel(R) OpenCL (裝置是:Intel(R) HD Graphics 與 Intel(R) Core(TM) i7-6700 CPU);

平臺2:Experimental OpenCL 2.1 CPU Only Platform (裝置是:Intel(R) Core(TM) i7-6700 CPU);

Number of platforms                               2
Platform Name                                   Intel(R) OpenCL
Platform Vendor                                 Intel(R) Corporation
Platform Version                                OpenCL 2.0 
Platform Profile                                FULL_PROFILE
Platform Extensions                             cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir
Platform Extensions function suffix             INTEL
Platform Name                                   Experimental OpenCL 2.1 CPU Only Platform
Platform Vendor                                 Intel(R) Corporation
Platform Version                                OpenCL 2.1 LINUX
Platform Profile                                FULL_PROFILE
Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer 
Platform Host timer resolution                  1ns
Platform Extensions function suffix             INTEL
Platform Name                                   Intel(R) OpenCL
Number of devices                                 2
Device Name                                     Intel(R) HD Graphics
Device Vendor                                   Intel(R) Corporation
Device Vendor ID                                0x8086
Device Version                                  OpenCL 2.0 
Driver Version                                  r5.0.63503
Device OpenCL C Version                         OpenCL C 2.0 
Device Type                                     GPU
Device Available                                Yes
Device Profile                                  FULL_PROFILE
Max compute units                               24
Max clock frequency                             1150MHz
Device Partition                                (core)
Max number of sub-devices                     0
Supported partition types                     None
Max work item dimensions                        3
Max work item sizes                             256x256x256
Max work group size                             256
Compiler Available                              Yes
Linker Available                                Yes
Preferred work group size multiple              32
Sub-group sizes (Intel)                         8x16x32
Preferred / native vector sizes                 
char                                                16 / 16      
short                                                8 / 8       
int                                                  4 / 4       
long                                                 1 / 1       
half                                                 8 / 8        (cl_khr_fp16)
float                                                1 / 1       
double                                               1 / 1        (cl_khr_fp64)
Half-precision Floating-point support           (cl_khr_fp16)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 Yes
Round to infinity                             Yes
IEEE754-2008 fused multiply-add               Yes
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  No
Single-precision Floating-point support         (core)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 Yes
Round to infinity                             Yes
IEEE754-2008 fused multiply-add               Yes
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  Yes
Double-precision Floating-point support         (cl_khr_fp64)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 Yes
Round to infinity                             Yes
IEEE754-2008 fused multiply-add               Yes
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  No
Address bits                                    64, Little-Endian
Global memory size                              13238812672 (12.33GiB)
Error Correction support                        No
Max memory allocation                           4294959103 (4GiB)
Unified memory for Host and Device              Yes
Shared Virtual Memory (SVM) capabilities        (core)
Coarse-grained buffer sharing                 Yes
Fine-grained buffer sharing                   No
Fine-grained system sharing                   No
Atomics                                       No
Minimum alignment for any data type             128 bytes
Alignment of base address                       1024 bits (128 bytes)
Preferred alignment for atomics                 
SVM                                           64 bytes
Global                                        64 bytes
Local                                         64 bytes
Max size for global variable                    65536 (64KiB)
Preferred total size of global vars             4294959103 (4GiB)
Global Memory cache type                        Read/Write
Global Memory cache size                        524288 (512KiB)
Global Memory cache line                        64 bytes
Image support                                   Yes
Max number of samplers per kernel             16
Max size for 1D images from buffer            268434943 pixels
Max 1D or 2D image array size                 2048 images
Base address alignment for 2D image buffers   4 bytes
Pitch alignment for 2D image buffers          4 bytes
Max 2D image size                             16384x16384 pixels
Max planar YUV image size                     16384x16380 pixels
Max 3D image size                             16384x16384x2048 pixels
Max number of read image args                 128
Max number of write image args                128
Max number of read/write image args           128
Max number of pipe args                         16
Max active pipe reservations                    1
Max pipe packet size                            1024
Local memory type                               Local
Local memory size                               65536 (64KiB)
Max constant buffer size                        4294959103 (4GiB)
Max number of constant args                     8
Max size of kernel argument                     1024
Queue properties (on host)                      
Out-of-order execution                        Yes
Profiling                                     Yes
Queue properties (on device)                    
Out-of-order execution                        Yes
Profiling                                     Yes
Preferred size                                131072 (128KiB)
Max size                                      67108864 (64MiB)
Max queues on device                            1
Max events on device                            1024
Prefer user sync for interop                    Yes
Profiling timer resolution                      83ns
Execution capabilities                          
Run OpenCL kernels                            Yes
Run native kernels                            No
SPIR versions                                 1.2 
printf() buffer size                            4194304 (4MiB)
Built-in kernels                                block_motion_estimate_intel;block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidirectional_check_intel
Motion Estimation accelerator version (Intel)   2
Device-side AVC Motion Estimation version     1
Supports texture sampler use                Yes
Supports preemption                         No
Device Extensions                               cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_device_side_avc_motion_estimation cl_intel_driver_diagnostics cl_intel_media_block_io cl_intel_motion_estimation cl_intel_planar_yuv cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_subgroups cl_intel_subgroups_short cl_intel_va_api_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_khr_spir cl_khr_subgroups 
Device Name                                     Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Device Vendor                                   Intel(R) Corporation
Device Vendor ID                                0x8086
Device Version                                  OpenCL 2.0 (Build 475)
Driver Version                                  1.2.0.475
Device OpenCL C Version                         OpenCL C 2.0 
Device Type                                     CPU
Device Available                                Yes
Device Profile                                  FULL_PROFILE
Max compute units                               8
Max clock frequency                             3400MHz
Device Partition                                (core)
Max number of sub-devices                     8
Supported partition types                     by counts, equally, by names (Intel)
Max work item dimensions                        3
Max work item sizes                             8192x8192x8192
Max work group size                             8192
Compiler Available                              Yes
Linker Available                                Yes
Preferred work group size multiple              128
Preferred / native vector sizes                 
char                                                 1 / 32      
short                                                1 / 16      
int                                                  1 / 8       
long                                                 1 / 4       
half                                                 0 / 0        (n/a)
float                                                1 / 8       
double                                               1 / 4        (cl_khr_fp64)
Half-precision Floating-point support           (n/a)
Single-precision Floating-point support         (core)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 No
Round to infinity                             No
IEEE754-2008 fused multiply-add               No
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  No
Double-precision Floating-point support         (cl_khr_fp64)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 Yes
Round to infinity                             Yes
IEEE754-2008 fused multiply-add               Yes
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  No
Address bits                                    64, Little-Endian
Global memory size                              16559001600 (15.42GiB)
Error Correction support                        No
Max memory allocation                           4139750400 (3.855GiB)
Unified memory for Host and Device              Yes
Shared Virtual Memory (SVM) capabilities        (core)
Coarse-grained buffer sharing                 Yes
Fine-grained buffer sharing                   No
Fine-grained system sharing                   No
Atomics                                       No
Minimum alignment for any data type             128 bytes
Alignment of base address                       1024 bits (128 bytes)
Preferred alignment for atomics                 
SVM                                           64 bytes
Global                                        64 bytes
Local                                         0 bytes
Max size for global variable                    65536 (64KiB)
Preferred total size of global vars             65536 (64KiB)
Global Memory cache type                        Read/Write
Global Memory cache size                        262144 (256KiB)
Global Memory cache line                        64 bytes
Image support                                   Yes
Max number of samplers per kernel             480
Max size for 1D images from buffer            258734400 pixels
Max 1D or 2D image array size                 2048 images
Base address alignment for 2D image buffers   64 bytes
Pitch alignment for 2D image buffers          64 bytes
Max 2D image size                             16384x16384 pixels
Max 3D image size                             2048x2048x2048 pixels
Max number of read image args                 480
Max number of write image args                480
Max number of read/write image args           480
Max number of pipe args                         16
Max active pipe reservations                    32767
Max pipe packet size                            1024
Local memory type                               Global
Local memory size                               32768 (32KiB)
Max constant buffer size                        131072 (128KiB)
Max number of constant args                     480
Max size of kernel argument                     3840 (3.75KiB)
Queue properties (on host)                      
Out-of-order execution                        Yes
Profiling                                     Yes
Local thread execution (Intel)                Yes
Queue properties (on device)                    
Out-of-order execution                        Yes
Profiling                                     Yes
Preferred size                                4294967295 (4GiB)
Max size                                      4294967295 (4GiB)
Max queues on device                            4294967295
Max events on device                            4294967295
Prefer user sync for interop                    No
Profiling timer resolution                      1ns
Execution capabilities                          
Run OpenCL kernels                            Yes
Run native kernels                            Yes
SPIR versions                                 1.2
printf() buffer size                            1048576 (1024KiB)
Built-in kernels                                
Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer 
Platform Name                                   Experimental OpenCL 2.1 CPU Only Platform
Number of devices                                 1
Device Name                                     Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Device Vendor                                   Intel(R) Corporation
Device Vendor ID                                0x8086
Device Version                                  OpenCL 2.1 (Build 10)
Driver Version                                  1.2.0.10
Device OpenCL C Version                         OpenCL C 2.0 
Device Type                                     CPU
Device Available                                Yes
Device Profile                                  FULL_PROFILE
Max compute units                               8
Max clock frequency                             3400MHz
Device Partition                                (core)
Max number of sub-devices                     8
Supported partition types                     by counts, equally, by names (Intel)
Max work item dimensions                        3
Max work item sizes                             8192x8192x8192
Max work group size                             8192
Compiler Available                              Yes
Linker Available                                Yes
Preferred work group size multiple              128
Max sub-groups per work group                   1
Preferred / native vector sizes                 
char                                                 1 / 32      
short                                                1 / 16      
int                                                  1 / 8       
long                                                 1 / 4       
half                                                 0 / 0        (n/a)
float                                                1 / 8       
double                                               1 / 4        (cl_khr_fp64)
Half-precision Floating-point support           (n/a)
Single-precision Floating-point support         (core)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 No
Round to infinity                             No
IEEE754-2008 fused multiply-add               No
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  No
Double-precision Floating-point support         (cl_khr_fp64)
Denormals                                     Yes
Infinity and NANs                             Yes
Round to nearest                              Yes
Round to zero                                 Yes
Round to infinity                             Yes
IEEE754-2008 fused multiply-add               Yes
Support is emulated in software               No
Correctly-rounded divide and sqrt operations  No
Address bits                                    64, Little-Endian
Global memory size                              16559001600 (15.42GiB)
Error Correction support                        No
Max memory allocation                           4139750400 (3.855GiB)
Unified memory for Host and Device              Yes
Shared Virtual Memory (SVM) capabilities        (core)
Coarse-grained buffer sharing                 Yes
Fine-grained buffer sharing                   Yes
Fine-grained system sharing                   Yes
Atomics                                       Yes
Minimum alignment for any data type             128 bytes
Alignment of base address                       1024 bits (128 bytes)
Preferred alignment for atomics                 
SVM                                           64 bytes
Global                                        64 bytes
Local                                         0 bytes
Max size for global variable                    65536 (64KiB)
Preferred total size of global vars             65536 (64KiB)
Global Memory cache type                        Read/Write
Global Memory cache size                        262144 (256KiB)
Global Memory cache line                        64 bytes
Image support                                   Yes
Max number of samplers per kernel             480
Max size for 1D images from buffer            258734400 pixels
Max 1D or 2D image array size                 2048 images
Base address alignment for 2D image buffers   64 bytes
Pitch alignment for 2D image buffers          64 bytes
Max 2D image size                             16384x16384 pixels
Max 3D image size                             2048x2048x2048 pixels
Max number of read image args                 480
Max number of write image args                480
Max number of read/write image args           480
Max number of pipe args                         16
Max active pipe reservations                    32767
Max pipe packet size                            1024
Local memory type                               Global
Local memory size                               32768 (32KiB)
Max constant buffer size                        131072 (128KiB)
Max number of constant args                     480
Max size of kernel argument                     3840 (3.75KiB)
Queue properties (on host)                      
Out-of-order execution                        Yes
Profiling                                     Yes
Local thread execution (Intel)                Yes
Queue properties (on device)                    
Out-of-order execution                        Yes
Profiling                                     Yes
Preferred size                                4294967295 (4GiB)
Max size                                      4294967295 (4GiB)
Max queues on device                            4294967295
Max events on device                            4294967295
Prefer user sync for interop                    No
Profiling timer resolution                      1ns
Execution capabilities                          
Run OpenCL kernels                            Yes
Run native kernels                            Yes
Sub-group independent forward progress        No
IL version                                    SPIR-V_1.0
SPIR versions                                 1.2
printf() buffer size                            1048576 (1024KiB)
Built-in kernels                                
Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer 
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
clCreateContext(NULL, ...) [default]            No platform
clCreateContext(NULL, ...) [other]              Success [INTEL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

太開心了!!!!Linux下Intel核顯for OpenCL環境搭建完畢!!!
四、用eclipse建立一個OpenCL專案測試

在相應地方加上/opt/intel/opencl-sdk/include和…/lib64即可!!!

五、Intel opencl samples

可以看看 https://software.intel.com/en-us/intel-opencl-support/code-samples   以及https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics  因為很怕本來0傳輸的被自己寫成記憶體到記憶體的傳輸
那麼用核顯就沒有意義了。。。

https://software.intel.com/en-us/forums/opencl/topic/708049 

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics  

https://arrayfire.com/zero-copy-on-integrated-gpus/

https://software.intel.com/en-us/forums/opencl/topic/708049

http://blog.csdn.net/wcj0626/article/details/46360605  

http://www.cnblogs.com/lifan3a/articles/4613858.html

六、Intel的除錯及效能檢視tool for linux

我在官網看了那個Amplifier XE要收費,後來查到有Intel VTune Performance Analyzer,我在 http://download.csdn.net/download/morre/8557459 下載的這個工具,這裡也有下載連結:http://intel.brothersoft.com/intel-vtune-performance-analyzer-for-linux.html?fromsearch  。另外:http://qcd.phys.cmu.edu/QCDcluster/intel/VTLRELEASENOTES.htm
 這裡還有各個版本的不同比較。http://blog.csdn.net/yangsc1984/article/details/3908823  這裡是別人的中文介紹。https://www.viva64.com/en/t/0091/ 這裡是別人的英文介紹。 http://blog.csdn.net/honey_yyang/article/details/7848923 這個是與eclipse結合後的強大功能介紹。 https://software.intel.com/en-us/code-builder-user-manual-installing-opencl-kernel-debugger-for-linux
官網說的除錯工具就在sdk裡? http://opencl38.rssing.com/chan-13188063/all_p8.html  這個是論壇上別人除錯時遇到的問題。             解壓準備安裝,但提示說需要序列號?!!!!我沒有序列號 ,然後Intel官網現在已經沒有這個Intel VTune Performance Analyzer 了?!!!!!現在官網上linux的工具只有:https://software.intel.com/en-us/intel-devtools-by-os/linux
 只有這些。沒有序列號,無法安裝免費的opencl工具,所以只能裝這個 https://software.intel.com/en-us/intel-vtune-amplifier-xe 使用30的,按照https://software.intel.com/en-us/vtune-amplifier-install-guide-linux-user-interface-install 來安裝。

七、一個例子與以前AMD時的對比

以前用AMD 560顯示卡時:CPU版本耗時約19s,OpenCL版本耗時約4.5s (單執行緒多執行緒均如此);

現在用Intel skylake核顯:CPU版本耗時約17.3s,單執行緒的OpenCL版本耗時約5s,但多執行緒的OpenCL版本竟然要38s?!(將OpenCL版本由單執行緒改成多執行緒時應該有問題)