What an NPU actually is
A Neural Processing Unit (NPU) is a dedicated hardware accelerator designed to execute the linear algebra that dominates neural network inference — primarily large matrix multiplications and convolutions at reduced numerical precision (INT8, INT4, FP16, BF16). Unlike a CPU, which is optimized for low-latency execution of branchy, control-heavy code, and unlike a GPU, which is optimized for massively parallel floating-point throughput across thousands of general-purpose shader cores, an NPU is built around a much narrower hardware contract: feed it tensors, get tensors back, do so at a few-watt power envelope.
Every modern consumer silicon platform ships one. Apple calls it the Neural Engine (introduced 2017 in the A11). Qualcomm calls it the Hexagon NPU. Intel calls it the AI Boost NPU (Meteor Lake, 2023). AMD calls it the Ryzen AI NPU (Phoenix, 2023). Google has the Tensor TPU block in Pixel SoCs. Samsung embeds NPUs in Exynos. MediaTek ships APU blocks in Dimensity. Microsoft drew a marketing line around the category — the "Copilot+ PC" tier — at 40 TOPS, which is now the de facto floor for premium x86 laptops.
The reason every silicon vendor converged is straightforward: AI inference workloads are becoming permanent fixtures on consumer devices. Always-on speech recognition, image segmentation in the camera ISP, neural noise suppression, on-device LLM responses, retrieval-augmented search, semantic photo indexing — these workloads are sustained, parallel, and tolerant of reduced precision. They are also wildly inefficient to run on a CPU or a discrete GPU. A purpose-built NPU completes the same inference in roughly one-fifth to one-tenth the energy.
NPU vs CPU vs GPU
The three are complementary, not competing, but the design constraints are different enough that the same workload can be 50x more efficient on one than another.
CPU. Optimized for low-latency, single-threaded performance. Branch prediction, out-of-order execution, deep cache hierarchies. A modern x86 or ARM core executes one to a few instructions per cycle on irregular control flow. It can do AI inference (especially with AVX-512 VNNI or ARM SVE2), but per-watt it is the worst of the three for sustained tensor math.
GPU. Optimized for parallel throughput across thousands of lightweight cores. Excellent at the dense matrix math at the heart of neural networks — which is why training happens on GPUs, full stop. Modern GPUs include tensor cores (NVIDIA) or matrix engines (AMD MI300, Apple GPU) that hit very high TOPS numbers at FP16/BF16. The cost is power: a discrete GPU pulls 100-700 W under load. Even the integrated GPU in a laptop SoC pulls more than the NPU for the same inference.
NPU. Fixed-function or semi-fixed pipelines for tensor operations. No graphics workload to serve, no general-purpose shader programming model, no need to keep thousands of threads in flight for occupancy. The hardware is laid out for systolic-array matrix multiplies, fused activation functions, and direct DMA from memory into the compute fabric. Result: roughly 5-10x more TOPS per watt than the GPU on the same die for the workloads it is designed for. Modern NPUs hit 40-60 TOPS at 2-3 W sustained.
The practical division of labor on a current laptop or phone:
- Training, large-batch inference, GPU-resident pipelines → GPU.
- Latency-sensitive, low-batch, sustained AI features running while the device is on battery → NPU.
- Pre/post-processing, control flow, glue code → CPU.
A single Microsoft Teams call on a Copilot+ PC may use all three simultaneously: NPU runs background blur and noise suppression, GPU runs the video codec, CPU handles the call control plane.
What TOPS actually measure (and why the number lies a little)
TOPS — Tera Operations Per Second — is the standard NPU performance metric. It counts how many multiply-accumulate (MAC) operations the accelerator can execute per second at a stated numerical precision, almost always INT8.
Three caveats make TOPS comparisons across vendors slippery:
- Precision matters. A 40 TOPS rating at INT8 is not 40 TOPS at FP16. INT8 doubles the throughput vs FP16 on the same hardware roughly; INT4 doubles it again. Vendors quote whichever precision flatters them.
- Peak vs sustained. Peak TOPS assumes the systolic array is 100% utilized. Real workloads rarely hit that. Sustained TOPS at the rated power envelope is the number that predicts user-visible performance.
- TOPS ignores memory bandwidth. Inference is often memory-bound, not compute-bound, especially for LLMs. An NPU rated 50 TOPS with 50 GB/s memory bandwidth will be slower on a 7B-parameter model than a 30 TOPS NPU with 100 GB/s.
The 40 TOPS Copilot+ floor is a Microsoft go-to-market gate, not a hardware capability boundary. Below it, certain on-device Copilot features (Recall, Live Captions translation, Studio Effects) refuse to run. The cutoff was chosen to align with the first generation of Snapdragon X Elite, Intel Lunar Lake, and AMD Strix Point — i.e. the silicon Microsoft wanted to certify in 2024-2025.
The vendors and their NPU roadmaps
Apple Neural Engine. Shipped in every A-series chip since A11 (2017) and every M-series chip since M1 (2020). The M4 Neural Engine is rated at 38 TOPS. The Neural Engine is tightly coupled to Apple's Core ML framework and the operating system schedulers; third-party developers reach it through Core ML, MLX, or via macOS/iOS APIs, not via direct programming. Apple does not publish low-level NPU documentation.
Qualcomm Hexagon NPU. Shipped in Snapdragon mobile and the Snapdragon X Elite / X Plus laptop chips. The X Elite NPU is rated at 45 TOPS, ahead of the Copilot+ floor. Developers access it through Qualcomm's QNN runtime, ONNX Runtime, or DirectML on Windows.
Intel AI Boost NPU. Debuted in Meteor Lake (2023, ~11 TOPS). Lunar Lake (2024) brought it to ~48 TOPS for Copilot+ compliance. Arrow Lake desktop chips brought a smaller NPU to mainline desktop x86 for the first time. Programming via OpenVINO or DirectML.
AMD Ryzen AI / XDNA. XDNA is AMD's Xilinx-derived NPU architecture. Phoenix (2023) shipped with ~10 TOPS; Strix Point (2024) hit 50 TOPS for Copilot+ tier; Strix Halo extended it to workstations. AMD exposes the NPU through ONNX Runtime and the Ryzen AI Software stack.
Google Tensor. Pixel-only. The TPU block in Tensor G4 powers on-device Gemini Nano features. Google does not productize the Tensor NPU outside Pixel hardware.
MediaTek APU and Samsung NPU. Both ship in Dimensity and Exynos respectively, accessible through vendor SDKs and increasingly through cross-vendor runtimes like ONNX Runtime Mobile and TensorFlow Lite.
The trend is convergence on common runtimes. ONNX Runtime, TensorFlow Lite, DirectML on Windows, and Core ML on Apple platforms are doing the abstraction work that lets a single model target any NPU. The next two years will be about software maturity, not silicon — every flagship now has roughly the same TOPS budget within a factor of two.
What the NPU actually accelerates today
The honest answer for 2026: a narrower workload set than the marketing suggests, but it is growing fast.
Always shipping on NPU.
- Camera ISP neural pipelines (HDR fusion, denoise, scene segmentation, smart HDR, Cinematic mode).
- On-device speech recognition (Siri, Pixel Recorder transcription, Windows Live Captions, dictation).
- Real-time meeting effects (background blur, eye contact correction, noise suppression — Microsoft Studio Effects, Apple FaceTime effects).
- Biometric authentication (Face ID, Windows Hello with enhanced sign-in security).
- Photo indexing and semantic search.
Increasingly running on NPU in 2025-2026.
- On-device small LLMs (Phi-3.5 Mini, Gemma 2 2B, Llama 3.2 1B/3B, Apple Foundation Models) for completion, summarization, and Copilot/Apple Intelligence features.
- Image generation at small resolution (Stable Diffusion variants tuned for NPU).
- Document understanding (PDF semantic search, automatic alt text).
- Recall-style retrieval embeddings over local content.
Still on GPU, not NPU.
- Anything 13B+ parameters at usable speed.
- Image generation at native resolution.
- Most creative-app neural features (Adobe Firefly, DaVinci Resolve neural engine).
- Anything that benefits from large batch size.
The dividing line is shifting toward the NPU year-over-year as memory bandwidth improves and on-device model quality reaches usable thresholds at smaller parameter counts.
What developers can actually do with the NPU
For most engineers, "use the NPU" means: export your model to ONNX, run it through ONNX Runtime with the appropriate execution provider (QNN for Qualcomm, DirectML for Windows generally, OpenVINO for Intel, Core ML for Apple, Ryzen AI for AMD), and let the runtime place ops on the NPU where it can and fall back to GPU/CPU where it cannot.
Direct NPU programming is rare and vendor-specific. There is no NPU equivalent of CUDA — no single language, no portable kernel programming model, no broadly-shared toolchain. This is the current biggest gap and the place vendors are competing.
The practical workflow:
- Train on GPU (cloud or workstation), produce an ONNX or Core ML model.
- Quantize to INT8 or INT4 using vendor tooling (QNN AI Hub, Intel Neural Compressor, ONNX Runtime quantization, Core ML Tools).
- Validate accuracy retention on a representative dataset.
- Deploy and let the runtime route to the NPU.
Quantization is the make-or-break step. NPUs are dramatically faster at INT8 than FP16 and faster again at INT4, but each precision drop costs some accuracy. The art is choosing the precision tier that preserves task-relevant accuracy on your specific workload.
Where this goes next
Three threads to watch through 2026-2027:
Memory bandwidth catches up to compute. LPDDR5X gave way to LPDDR6 and on-package memory; Apple, AMD, and Intel have all signaled unified-memory architectures with NPU access at GPU-class bandwidth. This unlocks the 13B+ parameter on-device tier.
The software stack consolidates. ONNX Runtime is becoming the lowest-common-denominator deployment target across NPUs. Core ML stays Apple-only but is broadly mature. Microsoft's DirectML and the Windows Copilot Runtime are converging on a single Windows-side abstraction.
The workload boundary keeps shifting. Expect the on-device LLM tier to move from 1-3B parameters today to 7-13B parameters by late 2026, with retrieval-augmented generation over local data as the killer pattern. The cloud will keep the frontier models; the NPU takes everything that is good-enough on-device and benefits from privacy, latency, or offline use.
The NPU is not a passing fad and not the end of the GPU. It is the third leg of the consumer compute stack, and the next two years are about figuring out what it actually wants to do.