I'm developing my implementation of Bitonic Sort in OpenCL. Basically I would want to have concurrent kernel execution because that would allow me to fully utilize resources. If I have few kernels, one LDS heavy, one Global Memory heavy, one ALU heavy, etc then interleaving them would lead to much better utilization of resources that running them serially - if one is stalled by LDS access, another can do ALU or transfer Global Memory.

I have did some research and it turned out that AMD haven't yet implemented CKE in their APP SDK. On the other hand, nVidia do not support OpenCL 1.1 officially yet - OpenCL 1.1 brings Out-of-Order queues, events, some extensions, etc nVidia provides OpenCL 1.1 enabled SDK's if you register somewhere. I don't have GeForce so I haven't googled that.

I've asked on AMD Developer Forum about timeline of whether CKE will be supported but haven't received satisfactory answer so far. My post is here: http://forums.amd.com/devforum/messa...&enterthread=y

I have prepared a program that tests if OpenCL implementation you've installed supports CKE. It's here: http://www28.zippyshare.com/v/17487664/file.html

It requires three parameters: <iterations> <multipleQueues> <threads>

iterations is a long value,
multipleQueues is a boolean value,
threads is an int value,

For example parameters could look like: 1234567 false 10

I have Juniper XT (ie. Radeon HD 5770) which has 10 compute units (10 SIMD arrays) so theoretically it should be able to run 10 different kernels at once if there's enough logic that would manage kernels. Sadly current OpenCL implementation from AMD executes only one kernel at a time.

Here's my terminal log:
piotrek@piotrek-pc:~/Pulpit/cke-test$ ./lin64.sh 1234567 true 1
Total kernels execution time: 286
Computations results (should be identical for identical number of iterations): 

piotrek@piotrek-pc:~/Pulpit/cke-test$ ./lin64.sh 1234567 true 10
Total kernels execution time: 2666
Computations results (should be identical for identical number of iterations): 

As you can see, running 10 different kernels consumes 10 times as much time, so it clearly shows that no kernel is running in parallel.

All kernels run by my program consist of single work-item, so they occupy only one SIMD array.

Currently I've coded Bitonic Sort for 4096 items wide blocks. It is heavily limited by LDS bandwidth, I don't know exactly how much but probably with fast LDS my algorithm would perform four times faster. That bottleneck could be hidden if CKE would be supported - while one (quarter-) wavefront waits for LDS, another could do ALU heavy task (eg. encoding) or Global Memory transfers.

Maybe someone here has more fresh info than me and can tell me something about CKE?

Anyway I would be happy if someone plays with my program and posts the results.

My program requires Java, should have higher chances of running if your Java has the same bitness as your operating system, ie. you should use 64-bit Java on 64-bit OS, and of course it requires OpenCL driver. I've developed this program on computer with Catalyst 11.3 drivers and AMD APP SDK version 2.4.

Additionally, there's NetBeans project containing sources: http://www49.zippyshare.com/v/67517804/file.html