AMD is hosting its Fusion Developer Summit this
week, and the overarching theme is heterogeneous computing and the
convergence of the CPU and GPU. During the initial keynote yesterday,
Senior Vice President and General Manager of Global Business Units (for
AMD) Dr. Lisa Su stepped on stage to talk about the company’s future
with HSA (Heterogeneous System Architectures).
One of the slides
she presented showed off the company’s Fusion APU roadmap which included
the Trinity APU’s successor — known as Kaveri. Kaveri will be able to
deliver up to 1TFLOPS of (likely single precision) compute performance,
thanks to its Graphics Core Next (GCN) GPU and a Steamroller-based CPU.
The really interesting reveal, though, is that Kaveri will feature
fully shared memory between the GPU and CPU.
AMD has been moving in the direction of a
unified CPU+GPU chip for a long time — starting with the Llano APU — and
Kaveri is the next step in achieving that goal of true convergence. AMD
announced at the keynote that “we are betting the company on APUs,” and
spent considerable time talking up the benefits of
the heterogeneous processor. Trinity, the company’s latest APU available
to consumers, beefs up the GPU and CPU interconnects with the Radeon
Memory Bus and FCL connections. These allow the GPU access to system
memory and the CPU to access the GPU frame buffer through a 256-bit and
128-bit wide bus (per channel, each direction) respectively. This allows
the graphics core and x86 processor modules to access the same memory
areas and communicate with each other.
Kaveri will take that idea
even further with shared memory and a unified address space. The company
is not yet talking about how it will specifically achieve this with
hardware, but a shared on-die cache is not out of the question — a layer
that has been noticeably absent from AMD’s APUs. Phil Rogers, AMD
Corporate Fellow, did state that the CPU and GPU would be able to share
data between them from a single unified address space. This will prevent
the relatively time-intensive need to copy data from CPU-addressable
memory to GPU-addressable memory space — and will vastly improve
performance as a result.
AMD gave two examples of programs
and situations where the heterogeneous architecture can improve
performance — and how shared memory can push performance even further.
The first example involved face detection algorithms. The algorithm
involves multiple stages where the image is scaled down and the search
square remains the same. In each stage, the algorithm looks for facial
features (eyes, chins, ears, nose, etc.) If it does not find facial
features, it discards the image and continues searching further scaled
down images.
Smaller numbers are better (represents shorter processing times)
The
first stage of the workload is very parallel so the GPU is well-suited
to the task. In the first few stages, the GPU performs well, but as the
stages advance (and there are more and more dead ends), the performance
of the GPU falls until it is eventually much slower than the CPU at the
task. It was at this point that Phil Rogers talked up the
company’s heterogeneous architecture and the benefits of a “unified,
shared, coherent memory.” By allowing the individual parts to play to
their strengths, the company estimates 2.5 times the performance and up
to a 40% reduction in power usage versus running the algorithm on either
the CPU or GPU only. AMD achieved the (best) numbers by using the GPU
for the first three stages and the CPU for the remaining stages (where
it was more efficient). It was further made possible because they did
not have to worry about copying the data to/from the CPU and GPU for
processing which would have slowed down performance too much for HSA to
be beneficial.
The company’s second HSA demo drilled down into
the time-intensive data copying issue even more. To show off how the
shared memory cuts down of execution time, and solves the shared memory
problem (of course), the company presented a server application called
Memcached (mem cache D) as an example. Memcached is a table of files
kept in system memory (i.e. ECC DDR3) that they use
store()
and
get()
functions on to serve up components of web pages without needing to pull the data from (much slower) disk memory.
When the
get()
function is ported to the GPU, performance of the application is
improved greatly due to its proficiency at parallel work. However, the
program then runs into a performance bottleneck due to a needed data
copy operation to bring the data and instructions from the CPU to the
GPU for processing.
Interestingly,
the discrete GPU is the fastest at processing the data, but in the end
is the slowest because it spends the majority of its execution time
moving the data to and from the GPU and CPU memory areas. While the
individual hardware is available to accelerate workloads in programs
that use both CPU and GPU for processing, a great deal of execution time
is spent moving data from the memory the CPU uses to the GPU memory
(especially for discrete GPUs).
Trinity improves upon this by
having the GPU on the same die as the CPU and providing a fast bus with
direct access to system memory (the same system memory the CPU uses,
though not necessarily the same address spaces). Kaveri will further
improve upon this by giving both types of processors fast access to the
same (single) set of data in memory. Cutting out the most time-intensive
task will let programs like Memcached hit performance potentials and
run as fast as the hardware will allow. In that way, unified and shared
memory is a good thing, and will open up avenues to performance gains
beyond
what can be achieved by Moore’s law
and additional CPU cores can alone. Allowing the GPU and CPU to
simultaneously work from the same data set opens a lot of interesting
doors for programmers to speed up workloads and manipulate data.
AMD Trinity APU die shot. Piledriver modules and caches are on the left.
While
AMD and the newly-formed HSA Foundation (currently AMD, ARM,
Imagination Technology, MediaTek, and Texas Instruments) are pushing
heterogeneous computing the hardest, it is technology that will be
beneficial to everyone. The industry is definitely moving towards a more
blended processing environment, something that began with the rise of
specialty GPGPU workstation programs and is now starting to integrate
itself with consumer applications. Standards like C++ AMP, OpenCL, and
Nvidia’s CUDA languages harness the graphics cards for certain tasks.
More and more programs are using the GPU for certain tasks (even if it’s
just drawing and managing the UI), and as developers jump on board it
should accelerate even more towards using components to their fullest on
the software side. On the hardware side of things, we are already
seeing integration of GPUs into the CPU die and specialty application
processors (at least in mobile SoCs). Such varied configurations are
becoming common and are continuing to evolve in a combined architecture
direction.The mobile industry is a good example of HSA
catching on with new system-on-a-chip processors coming out continuously
and mobile operating systems that harness GPU horsepower to assist the
ARM CPU cores. AMD isn’t just looking at low power devices, however —
it’s pushing for “one (HSA) chip to rule them all” solutions that
combine GPU cores with CPU cores (and even ARM cores!) that process what
they are best at and give the best user experiences.
The overall
transition of hardware and software that fully takes advantage of both
processing types is still a ways off but we are getting closer everyday.
Heterogeneous computing is the future, and assuming most software
developers can be made to recognize the benefits and program to take
advantage of the new chips, I’m all for it. When additional CPU cores
and smaller process nodes stop making the cut, heterogeneous computing is where the industry will look for performance gains.