Timothy Farrar blog is always a very very good source of information if you are looking for very in-depth thoughts and experiences on GPU.
Wednesday, Timothy published a post about his vision of the evolution of the GPU in a near future. In particular, Timothy exposes the idea that future GPUs could expose a new highly flexible mechanism for job distribution based on generic hardware managed queues (FIFO) associated to kernels.
Current GPUs start threads by scheduling groups of independent jobs between dependent state changes from a master control stream (a command buffer filled by the CPU). OpenGL conditional rendering provides a starting point to on-the-fly modify the task list in this stream and DX11 seams to go further with the DispatchIndirect function that enables DX Compute grid dimensions to come directly from
device memory. The idea is that future hardware may provide generic queues that could be filed by kernels and used proactively by the hardware scheduler to set up thread blocks and route data to an available core to start new thread blocks using the kernel associated with the queue.
Much of the work in parallel processing is related to grouping, moving and compacting or expanding data and end up to be data routing problems. This model seems to provide a very good way to handle grouping for data locality. That could allow kernels that reach a divergent point (such as branch divergence or data locality divergence) to output threads to new queues with a new domain coordinate to insure a new good grouping for continued computation. Data associated to a kernel would also be in the queue and managed in hardware, to provide very fast access to threads parameters.
This can be done using a CPU like coherent cache with a large vector processor like Larabee, but data routing becomes expensive with a coherent cache that consume transistor for a rooting that could have
been define explicitly by programer. When you attempt to do all this routing manually with dedicated local memory and high throughput global memory, it is still expensive, just less expensive. The idea of Timothy is that this mechanism could be highly hardware accelerated and could provide a big advantages to “traditional” GPUs over Larabee like more generic architectures. I really think this is the way to go for GPU to continue to provide high performances to more generic graphics rendering pipelines.
The same idea is developed on a TOG paper that will be presented at Siggraph this year. This paper present GRAMPS, a programming model that generalizes concepts from modern real-time graphics pipelines by exposing a model of execution mixing task parallelism and data parallelism containing both fixed-function and application-programmable processing stages that exchange data via queues.