How to share Big Array, like a lookup table among various kernel calls #55

rajxabc · 2019-06-26T05:33:17Z

Hi Tugrul ,
is it possible to share some readonly array(bigsize) between same kernel calls ?
something like , cudaMemcpyToSymbol
in your Opencl 1.2 API

Or is it possible to use 1 kernel to load Bigarray and use second kernel to access that ?

Kindly look at the code below

public int nRun()
        {
            double[] sintble = new double[1024];// constant
            double[] costble = new double[1024];//constant
            // for (0-->1024)
            //{
            // fill sin costable
            //}
            double[] BigTable = new double [16777216]; // Big// constant
            //{fill big table}

            double[] myArr = new double[1024]; // variant


            ClNumberCruncher gpu = new ClNumberCruncher(AcceleratorType.GPU | AcceleratorType.CPU, @"

        
        __kernel void
            myCalc(
                __global float * v1,
                __global float * v2,
                __global float * v3)
                {
                    int i = get_global_id(0);
                    
                    for(int kk = 0 ; kk<BigTable.Length;kk++) //BigTable.Length PASSED AS ARGUMENT
                        {
                                if(v2+v3)//some sin cos condition
                            v1[gid] += BigArray[kk];
                        }
                }

                ");

            ClArray<double> ddf = BigTable;
            ddf.readOnly = true;
            //{LOad to kernel as shared lookup table for multiplee kernel calls} ???
            
            ClArray<double> bbf = sintble;
            ClArray<double> ccf = costble;

            for (int i = 0; i < 100; i++)
            {
                myArr = new double[1024];
                // fill myArr
                ClArray<double> aaf = myArr;
                aaf.nextParam(bbf, ccf).compute(gpu, 1, "myCalc", 1024);
                // do something with myArr
            }
            return 0;
        }

The text was updated successfully, but these errors were encountered:

tugrul512bit · 2019-06-26T10:43:47Z

Constant with __ prefix type arrays can have a maximum of 64k elements and they are not shared between different kernels anyway

                __constant char arr[max 64k]={0};
                __kernel void hello(__global char * a)  {          }
                __kernel void hello2(__global char * a)  {  sees another copy of arr  but a is same data   }

Every kernel has its unique compilation unit. So to evade duplication of big data, you should use it as kernel parameter backed by ClArray. Since it is non-changing data, you should copy it only once or even initialize on gpu-side. Then use it with read/write disabled.

                array.read = false;
                array.write = false;

but make sure data is initialized before array-usage (with flags after this state change).

This will make all kernels use same parameter data without duplicating it for each kernel but it will duplicate for each GPU which is a hardware issue. To overcome this, you can use zero copy flag.

array.zeroCopy = true;

As you guess, this is for "streaming" type calculations where each element is read only once (you're free to read/write multiple times but performance is bad for that) and data only moves to GPU when it is needed. This is a direct RAM access by all GPUs using it. Reading from same cell by multiple GPUs is legal but writing by any GPU and reading by another concurrently is illegal.

Sorry for late reply.

tugrul512bit · 2019-06-26T10:45:36Z

Using readonly flag makes an optimization for cases like:

on every iteration, cpu writes data, gpu reads
always goes through pci-e but more optimized path than normal array

This is for hardware performance, not software. hence, the "flag" setting.

For software side, decorating the parameter with const (and maybe "restrict" too)

                __kernel void test(const __global char * arr)
                {
                    int i=get_global_id(0);
                
                }

should enable Nvidia's fast data path optimizations or AMD's equivalent for the kernel-side data loading.

tugrul512bit · 2019-06-26T10:50:19Z

No, there is no equivalent of setting constant arrays from CPU command(like cudaMemcpyToSymbol), I'm sorry.

tugrul512bit · 2019-06-26T10:57:56Z

If there is "initialize once, use always" scenario, then I'd do this:

set read flag to true (default), write flag to false
initialize its data
run your real kernel (you can also load it with a dummy kernel if array size and kernel workitems not matching)
- if kernel workitems and array elements(for loading) don't match, then set partialRead to false to force it load whole array at once regardless of workitem size
set read flag to false
run your real kernel

that for loop of yours could have a flag change in first step, maybe thats all needed. But, unique shared variables have to be used as "parameter" of kernel.

If there is "initialize frequently, load it always" scenario:

set readonly flag
initialize data on CPU
run kernel on GPU
initialize data on CPU
run kernel on GPU

If GPU data duplication is an issue (because of shared-distributed architecture going on background),

set zeroCopy flag
every element access will go through pci-e lanes(or at least page-faults are) and writing/reading "concurrently" between different kernels is not supported by all GPU architectures, this is not thoroughly tested
if its a "streaming" work, then this has best performance

tugrul512bit · 2019-06-26T11:10:19Z

Lastly, only OpenCL 2.0 supports static variables on global scope and it was not tested. I guess it works but only for same kernel (a kernel2 would still see another copy of its own) but OpenCL 2.0 still limits it by "const initializer expression" . I think CUDA is much more advanced than OpenCL in this case as you can even change gpu constant arrays from host.

You don't need to worry about equivalent of cudaMemcpyToSymbol when there is no equivalent of __device__. There are only cl-buffer copies and cl-buffers are used as kernel parameters, they are a kind of gpu memory handle carriers for the host side. This "cekirdekler" adds just another layer over them to treat them as C# arrays, so just as C# arrays can be shared between methods, their pointed GPU buffers can be too, without being duplicated inside same GPU. Also the 64kB limit of constant arrays in global scope is GPU's limitation and being stuck at __constant type (for program-scope) is OpenCL 1.2's limitation. I wish I had done this using CUDA but then there wouldn't be CPU/FPGA/GPU mixture possibility.

rajxabc · 2019-06-27T02:34:27Z

Thank you for your time.
This is lot to understand, will revert back to you after understanding your explanation.
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to share Big Array, like a lookup table among various kernel calls #55

How to share Big Array, like a lookup table among various kernel calls #55

rajxabc commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019

tugrul512bit commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited

rajxabc commented Jun 27, 2019

How to share Big Array, like a lookup table among various kernel calls #55

How to share Big Array, like a lookup table among various kernel calls #55

Comments

rajxabc commented Jun 26, 2019 • edited

tugrul512bit commented Jun 26, 2019 • edited

tugrul512bit commented Jun 26, 2019 • edited

tugrul512bit commented Jun 26, 2019

tugrul512bit commented Jun 26, 2019 • edited

tugrul512bit commented Jun 26, 2019 • edited

rajxabc commented Jun 27, 2019

rajxabc commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited

tugrul512bit commented Jun 26, 2019 •

edited