Recently I was playing around with CUDA and GPU acceleration of different algorithms. The only limitation of my setup was that, it was not a very new one and hence it didn't support dynamic parallelism with CUDA.
Now, if you know the quicksort algorithm, you will find that the algorithm by itself is "embarrassingly" parallel. That is, place the pivot in it's one position in one step and then sort the two side of the pivot in parallel. Now if you are working on a system which allows you create parallel threads from a section already running in parallel with other threads (Yes, I expanded Dynamic Parallelism in layman's terms, deal with it), then your work with quicksort becomes very simple. But lacking that is a handicap and I was handicapped in this case.
Now, if you lack dynamic parallelism, you need to call all the parallel thread from the parent process only. Writing a parallel quicksort without dynamic parallelism needs the iterative variant of the algorithms (it is pretty obvious, you cannot recursively call threads). Iterative Quicksort works using a stack. In each iteration, the high and low index of the next iterations is stored in a stack. The concept has been explained here. To accelerate this algorithm using GPU, we store the stack in the device memory, so that every thread can access it, and in every iteration, we create threads equal to the number of index pair in this stack. Each thread, then uses the index pair stored at the memory address equal to the thread ID and partitions that region of the array, storing the next index pair in the stack. The important point to be taken care is that the addition of index pair to the stack needs to be synchronised, which is achieved by using atomic add.
The working code for this can be found here.
Now, if you know the quicksort algorithm, you will find that the algorithm by itself is "embarrassingly" parallel. That is, place the pivot in it's one position in one step and then sort the two side of the pivot in parallel. Now if you are working on a system which allows you create parallel threads from a section already running in parallel with other threads (Yes, I expanded Dynamic Parallelism in layman's terms, deal with it), then your work with quicksort becomes very simple. But lacking that is a handicap and I was handicapped in this case.
Now, if you lack dynamic parallelism, you need to call all the parallel thread from the parent process only. Writing a parallel quicksort without dynamic parallelism needs the iterative variant of the algorithms (it is pretty obvious, you cannot recursively call threads). Iterative Quicksort works using a stack. In each iteration, the high and low index of the next iterations is stored in a stack. The concept has been explained here. To accelerate this algorithm using GPU, we store the stack in the device memory, so that every thread can access it, and in every iteration, we create threads equal to the number of index pair in this stack. Each thread, then uses the index pair stored at the memory address equal to the thread ID and partitions that region of the array, storing the next index pair in the stack. The important point to be taken care is that the addition of index pair to the stack needs to be synchronised, which is achieved by using atomic add.
The working code for this can be found here.
how did u test if ur system didn't support dynamic parallelism?
ReplyDeleteCUDA supports Dynamic Parallelism in sm_35 and higher, I was still using a sm_2x
Delete