Where is FPGA stronger than traditional CPU

The servers in the Microsoft data center are still dominated by traditional Intel CPUs, but according to our earlier reports, Microsoft is now planning to use field programmable arrays or field programmable gate arrays (FPGAs) to replace the original processor architecture. Microsoft can use its own software to specifically modify and serve itself. It is reported that these FPGA-made circuits have already appeared on the market, and Microsoft is negotiating procurement matters with a company called Altera.

1. Why use FPGA?

As we all know, Moore's Law for general-purpose processors (CPUs) is in its late years, while the scale of machine learning and Web services has grown exponentially. People use custom hardware to accelerate common computing tasks, but the ever-changing industry requires that these custom hardware can be reprogrammed to perform new types of computing tasks. FPGA (Field Programmable Gate Array) is a kind of hardware reconfigurable architecture. It has been used as a small batch replacement for ASIC for many years. However, it has been widely deployed in the data centers of companies such as Microsoft and Baidu in recent years. To provide powerful computing power and sufficient flexibility at the same time.

Why are FPGAs fast? "It's all well done by peers." Both the CPU and GPU belong to the von Neumann structure, and the instruction decoding is executed and the shared memory is used. The reason why FPGAs are more energy efficient than CPUs or even GPUs is essentially a benefit of an architecture that has no instructions and no shared memory.

In Feng's structure, since the execution unit (such as the CPU core) may execute arbitrary instructions, an instruction memory, a decoder, an arithmetic unit for various instructions, and branch jump processing logic are required. Because the control logic of the instruction stream is complex, there can be not too many independent instruction streams, so the GPU uses SIMD (single instruction stream multiple data stream) to allow multiple execution units to process different data at the same pace, and the CPU also supports SIMD instruction. The function of each logic unit of FPGA is determined when reprogramming (burning), and no instruction is required.

There are two roles for using memory in Feng's structure. One is to save the state, and the other is to communicate between the execution units. Since memory is shared, access arbitration is required; in order to take advantage of access locality, each execution unit has a private cache, which must maintain the consistency of the cache between execution components. For the need to save state, the registers and on-chip memory (BRAM) in the FPGA belong to their respective control logics, without unnecessary arbitration and buffering. For the communication requirements, the connection between each logic unit of FPGA and the surrounding logic unit has been determined when reprogramming (burning), and does not need to communicate through shared memory.

Having said so many 3,000 feet in height, how does the FPGA actually perform? Let's look at computationally intensive tasks and communication intensive tasks separately.

Examples of computationally intensive tasks include matrix operations, image processing, machine learning, compression, asymmetric encryption, Bing search ranking, etc. This type of task is usually offloaded by the CPU to the FPGA for execution. For this type of task, we are currently using Altera (it seems to be called Intel, I am still used to calling Altera...) The integer multiplication performance of StraTIx V FPGA is basically the same as that of 20-core CPU, and the performance of floating-point multiplication is 8 cores. The CPU is basically equivalent, but an order of magnitude lower than the GPU. The next-generation FPGA we will be using, StraTIx 10, will be equipped with more multipliers and hardware floating-point arithmetic components, so that it can theoretically achieve the same computing power as the current top GPU computing card.

In the data center, the core advantage of FPGA over GPU is latency. For tasks like Bing search sorting, to return search results as fast as possible, you need to reduce the delay of each step as much as possible. If you use GPU to accelerate, to make full use of the computing power of GPU, the batch size cannot be too small, and the delay will be up to the order of milliseconds. If you use FPGA to accelerate, you only need microsecond PCIe delay (our current FPGA is used as a PCIe accelerator card). In the future, after Intel introduces Xeon + FPGA connected through QPI, the delay between CPU and FPGA can be reduced to less than 100 nanoseconds, which is no different from accessing main memory.

Why is FPGA so much lower latency than GPU? This is essentially the difference in architecture. FPGA has both pipeline parallel and data parallel, while GPU has almost only data parallel (pipeline depth is limited). For example, there are 10 steps to process a data packet. FPGA can build a 10-stage pipeline. Different stages of the pipeline process different data packets. Each data packet is processed after passing through 10 stages. Every time a packet is processed, it can be output immediately. The GPU data parallel method is to do 10 computing units, and each computing unit is also processing different data packets. However, all computing units must do the same thing at a uniform pace (SIMD, Single InstrucTIon MulTIple Data). This requires that 10 data packets must be input and output together, and the input and output delay increases. When tasks arrive individually rather than in batches, pipeline parallelism can achieve lower latency than data parallelism. Therefore, FPGAs have inherent latency advantages over GPUs for streaming computing tasks.

ASIC-specific chips are blameless in terms of throughput, latency and power consumption, but Microsoft did not adopt them for two reasons:

1. The computing tasks of the data center are flexible and changeable, and the ASIC R&D cost is high and the cycle is long. Finally, a batch of accelerator cards of some kind of neural network was deployed on a large scale. As a result, another kind of neural network became more popular, and the money wasted. FPGA only needs a few hundred milliseconds to update the logic function. The flexibility of FPGA can protect investment, in fact, Microsoft's FPGA gameplay is very different from the original idea.

2. The data center is leased to different tenants. If some machines have neural network accelerator cards, some machines have Bing search accelerator cards, and some machines have network virtualization accelerator cards, task scheduling and servers O&M will be troublesome. Using FPGA can maintain the homogeneity of the data center.

Next, look at communication-intensive tasks. Compared with computationally intensive tasks, the processing of each input data by communication intensive tasks is not very complicated. Basically, it is simply calculated and output. At this time, communication often becomes a bottleneck. Symmetric encryption, firewalls, and network virtualization are all communication-intensive examples.

For communication-intensive tasks, FPGA has greater advantages over CPU and GPU. In terms of throughput, the transceiver on the FPGA can be directly connected to a 40 Gbps or even 100 Gbps network cable to process data packets of any size at line speed; while the CPU needs to receive the data packet from the network card to process it, many network cards cannot Wire-speed processing of small packets of 64 bytes. Although high performance can be achieved by inserting multiple network cards, the number of PCIe slots supported by the CPU and the motherboard is often limited, and the network cards and switches themselves are expensive.

In terms of latency, the network card receives the packet from the CPU and the CPU sends it to the network card. Even if a high-performance packet processing framework such as DPDK is used, the delay is 4 to 5 microseconds. A more serious problem is that the latency of general-purpose CPUs is not stable enough. For example, when the load is high, the forwarding delay may rise to tens of microseconds or even higher (as shown in the figure below); clock interruption and task scheduling in modern operating systems also increase the uncertainty of delay.

Although the GPU can also process data packets with high performance, the GPU does not have a network port, which means that the data packet needs to be collected by the network card first, and then the GPU does the processing. This throughput is limited by the CPU and/or network card. Not to mention the latency of the GPU itself.

So why not make these network functions into a network card, or use a programmable switch? The flexibility of ASIC is still flawed. Although there are more and more powerful programmable switch chips, such as Tofino that supports the P4 language, ASIC still cannot do complex stateful processing, such as some custom encryption algorithm.

In summary, the main advantage of FPGAs in data centers is stable and extremely low latency, which is suitable for streaming computation-intensive tasks and communication-intensive tasks.