Friday, June 15, 2012

The future of AMD’s Fusion APUs: Kaveri will fully share memory between CPU and GPU


AMD is hosting its Fusion Developer Summit this week, and the overarching theme is heterogeneous computing and the convergence of the CPU and GPU. During the initial keynote yesterday, Senior Vice President and General Manager of Global Business Units (for AMD) Dr. Lisa Su stepped on stage to talk about the company’s future with HSA (Heterogeneous System Architectures).
One of the slides she presented showed off the company’s Fusion APU roadmap which included the Trinity APU’s successor — known as Kaveri. Kaveri will be able to deliver up to 1TFLOPS of (likely single precision) compute performance, thanks to its Graphics Core Next (GCN) GPU and a Steamroller-based CPU. The really interesting reveal, though, is that Kaveri will feature fully shared memory between the GPU and CPU.


AMD has been moving in the direction of a unified CPU+GPU chip for a long time — starting with the Llano APU — and Kaveri is the next step in achieving that goal of true convergence. AMD announced at the keynote that “we are betting the company on APUs,” and spent considerable time talking up the benefits of the heterogeneous processor. Trinity, the company’s latest APU available to consumers, beefs up the GPU and CPU interconnects with the Radeon Memory Bus and FCL connections. These allow the GPU access to system memory and the CPU to access the GPU frame buffer through a 256-bit and 128-bit wide bus (per channel, each direction) respectively. This allows the graphics core and x86 processor modules to access the same memory areas and communicate with each other.
Kaveri will take that idea even further with shared memory and a unified address space. The company is not yet talking about how it will specifically achieve this with hardware, but a shared on-die cache is not out of the question — a layer that has been noticeably absent from AMD’s APUs. Phil Rogers, AMD Corporate Fellow, did state that the CPU and GPU would be able to share data between them from a single unified address space. This will prevent the relatively time-intensive need to copy data from CPU-addressable memory to GPU-addressable memory space — and will vastly improve performance as a result.
AMD gave two examples of programs and situations where the heterogeneous architecture can improve performance — and how shared memory can push performance even further. The first example involved face detection algorithms. The algorithm involves multiple stages where the image is scaled down and the search square remains the same. In each stage, the algorithm looks for facial features (eyes, chins, ears, nose, etc.) If it does not find facial features, it discards the image and continues searching further scaled down images.
he first stage but thCPU vs GPU at face detection algorithm.
Smaller numbers are better (represents shorter processing times)
The first stage of the workload is very parallel so the GPU is well-suited to the task. In the first few stages, the GPU performs well, but as the stages advance (and there are more and more dead ends), the performance of the GPU falls until it is eventually much slower than the CPU at the task. It was at this point that Phil Rogers talked up the company’s heterogeneous architecture and the benefits of a “unified, shared, coherent memory.” By allowing the individual parts to play to their strengths, the company estimates 2.5 times the performance and up to a 40% reduction in power usage versus running the algorithm on either the CPU or GPU only. AMD achieved the (best) numbers by using the GPU for the first three stages and the CPU for the remaining stages (where it was more efficient). It was further made possible because they did not have to worry about copying the data to/from the CPU and GPU for processing which would have slowed down performance too much for HSA to be beneficial.


The company’s second HSA demo drilled down into the time-intensive data copying issue even more. To show off how the shared memory cuts down of execution time, and solves the shared memory problem (of course), the company presented a server application called Memcached (mem cache D) as an example. Memcached is a table of files kept in system memory (i.e. ECC DDR3) that they use store() and get() functions on to serve up components of web pages without needing to pull the data from (much slower) disk memory.
When the get() function is ported to the GPU, performance of the application is improved greatly due to its proficiency at parallel work. However, the program then runs into a performance bottleneck due to a needed data copy operation to bring the data and instructions from the CPU to the GPU for processing.
AMD demos HSA accelerating MEMCACHED at AFDS 2012
Interestingly, the discrete GPU is the fastest at processing the data, but in the end is the slowest because it spends the majority of its execution time moving the data to and from the GPU and CPU memory areas. While the individual hardware is available to accelerate workloads in programs that use both CPU and GPU for processing, a great deal of execution time is spent moving data from the memory the CPU uses to the GPU memory (especially for discrete GPUs).
Trinity improves upon this by having the GPU on the same die as the CPU and providing a fast bus with direct access to system memory (the same system memory the CPU uses, though not necessarily the same address spaces). Kaveri will further improve upon this by giving both types of processors fast access to the same (single) set of data in memory. Cutting out the most time-intensive task will let programs like Memcached hit performance potentials and run as fast as the hardware will allow. In that way, unified and shared memory is a good thing, and will open up avenues to performance gains beyond what can be achieved by Moore’s law and additional CPU cores can alone. Allowing the GPU and CPU to simultaneously work from the same data set opens a lot of interesting doors for programmers to speed up workloads and manipulate data.

AMD Trinity APU die shot. Piledriver modules and caches are on the left.
AMD Trinity APU die shot. Piledriver modules and caches are on the left.

While AMD and the newly-formed HSA Foundation (currently AMD, ARM, Imagination Technology, MediaTek, and Texas Instruments) are pushing heterogeneous computing the hardest, it is technology that will be beneficial to everyone. The industry is definitely moving towards a more blended processing environment, something that began with the rise of specialty GPGPU workstation programs and is now starting to integrate itself with consumer applications. Standards like C++ AMP, OpenCL, and Nvidia’s CUDA languages harness the graphics cards for certain tasks. More and more programs are using the GPU for certain tasks (even if it’s just drawing and managing the UI), and as developers jump on board it should accelerate even more towards using components to their fullest on the software side. On the hardware side of things, we are already seeing integration of GPUs into the CPU die and specialty application processors (at least in mobile SoCs). Such varied configurations are becoming common and are continuing to evolve in a combined architecture direction.The mobile industry is a good example of HSA catching on with new system-on-a-chip processors coming out continuously and mobile operating systems that harness GPU horsepower to assist the ARM CPU cores. AMD isn’t just looking at low power devices, however — it’s pushing for “one (HSA) chip to rule them all” solutions that combine GPU cores with CPU cores (and even ARM cores!) that process what they are best at and give the best user experiences.
The overall transition of hardware and software that fully takes advantage of both processing types is still a ways off but we are getting closer everyday. Heterogeneous computing is the future, and assuming most software developers can be made to recognize the benefits and program to take advantage of the new chips, I’m all for it. When additional CPU cores and smaller process nodes stop making the cut, heterogeneous computing is where the industry will look for performance gains.




No comments:

Post a Comment