This article is part of the Technology Insight series, made possible with funding from Intel.

As data stretches from the core of the network to the smart edge, it follows increasingly diverse compute resources, balancing power, performance, and response time. Historically, graphics processors (GPUs) were the preferred offload target for data processing. Today, field-programmable gate arrays (FPGAs), vision processing units (VPUs), and application-specific integrated circuits (ASICs) also bring unique strengths to the table. Intel refers to those accelerators (and anything else a CPU can send processing tasks to) as XPUs.

The challenge software developers face is determining which XPU is best for their workload; arriving at an answer often involves a lot of trial and error. Faced with a growing list of architecture-specific programming tools to support, Intel topped a standards-based programming model called oneAPI to unify the code between the XPU types. Simplifying software development for XPU cannot happen soon enough. After all, the shift to heterogeneous computing (processing in the best XPU for a given application) seems inevitable, given the evolving use cases and the many devices competing to address them.


  • Intel considers heterogeneous computing (where a host device sends computing tasks to different accelerators) to be unavoidable.
  • An XPU can be any CPU-driven download destination, built on top of any hardware vendor architecture.
  • The oneAPI initiative is an open, standards-based programming model that enables developers to target multiple XPUs with a single code base.

Intel’s strategy faces headwinds from NVIDIA’s incumbent CUDA platform, which assumes it is using NVIDIA graphics processors exclusively. That walled garden may not be as impenetrable as it once was. Intel already has a design victory with its upcoming Xe-HPC GPU, codenamed Ponte Vecchio. the Argonne National Laboratory Aurora SupercomputerFor example, you will have more than 9,000 nodes, each with six Xe-HPCs with a total of more than 1 exa / FLOP / s of sustained DP performance.

Time will tell if Intel can deliver on its promise of optimizing heterogeneous programming with a single API, lowering the barrier to entry for both hardware vendors and software developers. A compelling XPU roadmap certainly gives the industry reason to take a closer look.

Heterogeneous computing is the future, but it’s not easy

The total volume of data distributed among internal data centers, cloud repositories, third-party data centers, and remote locations is expected to increase by more than 42% from 2020 to 2022, according to Seagate Rethink Data Survey. The value of that information depends on what you do with it, where and when. Some data can be captured, classified, and stored to drive advancements in machine learning. Other applications require a real-time response.

The computing resources required to satisfy those use cases are nothing alike. GPUs optimized for server platforms draw hundreds of watts each VPU in the one watt range It could power smart cameras or computer vision-based artificial intelligence devices. In any example, a developer must decide on the best XPU to process data as efficiently as possible. This is not a new phenomenon. Rather, it is an evolution of a decades-long trend toward heterogeneity, where applications can run compute, data, and control tasks on the hardware architecture best suited to each specific workload.

Above: The pursuit of higher performance will make heterogeneous computing a necessity.

“The transition to heterogeneity is inevitable for the same reasons that we went from single-core CPUs to multi-core CPUs,” says James Reinders, an Intel engineer specializing in parallel computing. “It’s making our computers more capable and capable of solving more problems and doing things that they couldn’t do in the past, but within the limitations of the hardware that we can design and build.”

As with the adoption of multicore processing, which forced developers to start thinking of their algorithms in terms of parallelism, the biggest obstacle to making computers more heterogeneous today is the complexity of programming them.

It used to be that developers programmed close to the hardware using low-level languages, providing very little abstraction. The code was often fast and efficient, but not portable. These days, higher-level languages ​​extend support across a broader swath of hardware while hiding a lot of unnecessary detail. The compilers, runtimes, and libraries below the code make the hardware do what you want. It makes sense that we are looking at more specialized architectures that allow new functionalities through abstract languages.

oneAPI aims to simplify software development for XPU

Even now, new accelerators require their own software stacks, consuming the hardware vendor’s time and money. From there, developers make their own investment to learn new tools so they can determine the best architecture for their application.

Rather than wasting time rewriting and recompiling code using different libraries and SDKs, imagine an open cross-architecture model that can be used to migrate between architectures without leaving performance on the table. That’s what Intel is proposing with its oneAPI initiative.

Above: The OneAPI Base Toolkit includes everything you need to get started writing applications that take advantage of Intel’s CPU and XPU architectures.

oneAPI supports high-level languages ​​(Data Parallel C ++ or DPC ++), a set of APIs and libraries, and a hardware abstraction layer for low-level XPU access. In addition to the open specification, Intel has your own set of tools for various development tasks. The Base Toolkit, for example, includes the DPC ++ compiler, a handful of libraries, a compatibility tool for migrating NVIDIA CUDA code to DPC ++, the optimization-oriented VTune profiler, and the Advisor analysis tool, which helps to identify the best kernels. to download. Other toolkits focus on more specific segments, such as HPC, AI and machine learning acceleration, IoT, rendering, and deep learning inference.

“When we talk about oneAPI at Intel, it’s a pretty simple concept,” says Reinders of Intel. “I want as much as possible to be the same. Not that there is an API for everything. Rather, if I want to do fast Fourier transforms, I want to learn the interface from an FFT library, then I want to use that same interface for all my XPUs. “

Intel is not putting its influence behind oneAPI for purely disinterested reasons. The company already has a broad portfolio of XPUs that will benefit from a unified programming model (in addition to the host processors tasked with controlling them). If each XPU were treated as an island, the industry would end up stagnant where it was before an API: with separate software ecosystems, marketing resources, and training for each architecture. By making it as common as possible, developers can spend more time innovating and less time reinventing the wheel.

What will it take for the industry to start caring about Intel’s message?

A large amount of FLOP / s, or floating-point operations per second, comes from the GPUs. NVIDIA’s CUDA is the dominant platform for general-purpose GPU computing, and it assumes you’re using NVIDIA hardware. Because CUDA is the predominant technology, developers are reluctant to change software that already works, even if they prefer more hardware options.

Above: Intel’s Xe-HPC GPU employs new architecture, high-bandwidth memory, and advanced packaging technologies to deliver unprecedented performance.

If Intel wants the community to look beyond proprietary locking, it needs to build a better mouse trap than its competition, and that starts with compelling GPU hardware. On his recent Architecture Day 2021, Intel revealed that a pre-production implementation of its Xe-HPC architecture is already producing more than 45 TFLOPS of FP32 performance, more than 5 TB / s of fabric bandwidth, and more than 2 TB / s of memory bandwidth. . On paper at least, it’s higher single-precision performance than NVIDIA’s fastest data center processor.

However, the world of XPUs is more than just GPUs, which is exhilarating and scary, depending on who you ask. Backed by an open, standards-based programming model, a panoply of architectures can enable time-to-market benefits, dramatically lower power consumption, or specific workload optimizations. But without an API (or something like that), developers get stuck learning new tools for each accelerator, hampering innovation and overwhelming programmers.

Above: Fugaku, the world’s fastest supercomputer, uses an optimized oneDNN code to maximize the performance of its Arm-based CPUs.

Fortunately, we are seeing signs of life beyond NVIDIA’s closed platform. As an example, the team behind RIKEN’s Fugaku supercomputer recently used Intel’s oneAPI Deep Neural Network Library (oneDNN) as a reference to develop their own deep learning process library. Fugaku uses Fujitsu A64FX CPUs, based on Armv8-A with the Scalable Vector Extension (SVE) instruction set, which did not yet have a DL library. Intel Code Optimization for Armv8-A Processors enabled up to 400x acceleration compared to simply recompiling oneDNN without modification. Bringing those changes into the main branch of the library makes the team’s earnings available to other developers.

Intel’s Reinders acknowledges that it all looks a lot like open source. However, the XPU philosophy goes one step further, affecting the way the code is written so that it is ready for different types of accelerators running underneath it. “I’m not worried that this is some kind of fad,” he says. “It is one of the next important steps in computing. It’s not about whether an idea like an API will happen, but when it will happen. “



Please enter your comment!
Please enter your name here