Skip to content

AI Infrastructure & Hardware

NPU Explained: What a Neural Processing Unit Is, How It Differs From a CPU and GPU

How NPUs work, why every laptop and phone now has one, and what they actually accelerate

By Deepak Gupta·May 21, 2026·9 min read

Key Findings

  • An NPU is a fixed-function accelerator for low-precision matrix and tensor operations — the math that dominates neural network inference. It is not a small GPU.
  • NPUs win on power efficiency for sustained on-device inference: typically 5-10x more TOPS per watt than running the same model on a GPU.
  • The 40-TOPS Copilot+ threshold is a marketing line drawn around Microsoft's on-device AI features, not a hardware capability boundary.
  • Apple, Qualcomm, Intel, AMD, MediaTek, and Google have all converged on shipping NPUs in consumer silicon; the differentiator is now software stacks and developer access, not raw TOPS.
  • Most workloads still run on GPU or CPU. The NPU's killer use cases — always-on transcription, on-device LLMs at 7B-class size, real-time camera effects — are still being defined.
NPUAI hardwareneural processing unitedge AIon-device AICopilot+Apple Neural EngineQualcomm Hexagon

What an NPU actually is

A Neural Processing Unit (NPU) is a dedicated hardware accelerator designed to execute the linear algebra that dominates neural network inference — primarily large matrix multiplications and convolutions at reduced numerical precision (INT8, INT4, FP16, BF16). Unlike a CPU, which is optimized for low-latency execution of branchy, control-heavy code, and unlike a GPU, which is optimized for massively parallel floating-point throughput across thousands of general-purpose shader cores, an NPU is built around a much narrower hardware contract: feed it tensors, get tensors back, do so at a few-watt power envelope.

Every modern consumer silicon platform ships one. Apple calls it the Neural Engine (introduced 2017 in the A11). Qualcomm calls it the Hexagon NPU. Intel calls it the AI Boost NPU (Meteor Lake, 2023). AMD calls it the Ryzen AI NPU (Phoenix, 2023). Google has the Tensor TPU block in Pixel SoCs. Samsung embeds NPUs in Exynos. MediaTek ships APU blocks in Dimensity. Microsoft drew a marketing line around the category — the "Copilot+ PC" tier — at 40 TOPS, which is now the de facto floor for premium x86 laptops.

The reason every silicon vendor converged is straightforward: AI inference workloads are becoming permanent fixtures on consumer devices. Always-on speech recognition, image segmentation in the camera ISP, neural noise suppression, on-device LLM responses, retrieval-augmented search, semantic photo indexing — these workloads are sustained, parallel, and tolerant of reduced precision. They are also wildly inefficient to run on a CPU or a discrete GPU. A purpose-built NPU completes the same inference in roughly one-fifth to one-tenth the energy.

NPU vs CPU vs GPU

The three are complementary, not competing, but the design constraints are different enough that the same workload can be 50x more efficient on one than another.

CPU. Optimized for low-latency, single-threaded performance. Branch prediction, out-of-order execution, deep cache hierarchies. A modern x86 or ARM core executes one to a few instructions per cycle on irregular control flow. It can do AI inference (especially with AVX-512 VNNI or ARM SVE2), but per-watt it is the worst of the three for sustained tensor math.

GPU. Optimized for parallel throughput across thousands of lightweight cores. Excellent at the dense matrix math at the heart of neural networks — which is why training happens on GPUs, full stop. Modern GPUs include tensor cores (NVIDIA) or matrix engines (AMD MI300, Apple GPU) that hit very high TOPS numbers at FP16/BF16. The cost is power: a discrete GPU pulls 100-700 W under load. Even the integrated GPU in a laptop SoC pulls more than the NPU for the same inference.

NPU. Fixed-function or semi-fixed pipelines for tensor operations. No graphics workload to serve, no general-purpose shader programming model, no need to keep thousands of threads in flight for occupancy. The hardware is laid out for systolic-array matrix multiplies, fused activation functions, and direct DMA from memory into the compute fabric. Result: roughly 5-10x more TOPS per watt than the GPU on the same die for the workloads it is designed for. Modern NPUs hit 40-60 TOPS at 2-3 W sustained.

The practical division of labor on a current laptop or phone:

  • Training, large-batch inference, GPU-resident pipelines → GPU.
  • Latency-sensitive, low-batch, sustained AI features running while the device is on battery → NPU.
  • Pre/post-processing, control flow, glue code → CPU.

A single Microsoft Teams call on a Copilot+ PC may use all three simultaneously: NPU runs background blur and noise suppression, GPU runs the video codec, CPU handles the call control plane.

What TOPS actually measure (and why the number lies a little)

TOPS — Tera Operations Per Second — is the standard NPU performance metric. It counts how many multiply-accumulate (MAC) operations the accelerator can execute per second at a stated numerical precision, almost always INT8.

Three caveats make TOPS comparisons across vendors slippery:

  1. Precision matters. A 40 TOPS rating at INT8 is not 40 TOPS at FP16. INT8 doubles the throughput vs FP16 on the same hardware roughly; INT4 doubles it again. Vendors quote whichever precision flatters them.
  2. Peak vs sustained. Peak TOPS assumes the systolic array is 100% utilized. Real workloads rarely hit that. Sustained TOPS at the rated power envelope is the number that predicts user-visible performance.
  3. TOPS ignores memory bandwidth. Inference is often memory-bound, not compute-bound, especially for LLMs. An NPU rated 50 TOPS with 50 GB/s memory bandwidth will be slower on a 7B-parameter model than a 30 TOPS NPU with 100 GB/s.

The 40 TOPS Copilot+ floor is a Microsoft go-to-market gate, not a hardware capability boundary. Below it, certain on-device Copilot features (Recall, Live Captions translation, Studio Effects) refuse to run. The cutoff was chosen to align with the first generation of Snapdragon X Elite, Intel Lunar Lake, and AMD Strix Point — i.e. the silicon Microsoft wanted to certify in 2024-2025.

The vendors and their NPU roadmaps

Apple Neural Engine. Shipped in every A-series chip since A11 (2017) and every M-series chip since M1 (2020). The M4 Neural Engine is rated at 38 TOPS. The Neural Engine is tightly coupled to Apple's Core ML framework and the operating system schedulers; third-party developers reach it through Core ML, MLX, or via macOS/iOS APIs, not via direct programming. Apple does not publish low-level NPU documentation.

Qualcomm Hexagon NPU. Shipped in Snapdragon mobile and the Snapdragon X Elite / X Plus laptop chips. The X Elite NPU is rated at 45 TOPS, ahead of the Copilot+ floor. Developers access it through Qualcomm's QNN runtime, ONNX Runtime, or DirectML on Windows.

Intel AI Boost NPU. Debuted in Meteor Lake (2023, ~11 TOPS). Lunar Lake (2024) brought it to ~48 TOPS for Copilot+ compliance. Arrow Lake desktop chips brought a smaller NPU to mainline desktop x86 for the first time. Programming via OpenVINO or DirectML.

AMD Ryzen AI / XDNA. XDNA is AMD's Xilinx-derived NPU architecture. Phoenix (2023) shipped with ~10 TOPS; Strix Point (2024) hit 50 TOPS for Copilot+ tier; Strix Halo extended it to workstations. AMD exposes the NPU through ONNX Runtime and the Ryzen AI Software stack.

Google Tensor. Pixel-only. The TPU block in Tensor G4 powers on-device Gemini Nano features. Google does not productize the Tensor NPU outside Pixel hardware.

MediaTek APU and Samsung NPU. Both ship in Dimensity and Exynos respectively, accessible through vendor SDKs and increasingly through cross-vendor runtimes like ONNX Runtime Mobile and TensorFlow Lite.

The trend is convergence on common runtimes. ONNX Runtime, TensorFlow Lite, DirectML on Windows, and Core ML on Apple platforms are doing the abstraction work that lets a single model target any NPU. The next two years will be about software maturity, not silicon — every flagship now has roughly the same TOPS budget within a factor of two.

What the NPU actually accelerates today

The honest answer for 2026: a narrower workload set than the marketing suggests, but it is growing fast.

Always shipping on NPU.

  • Camera ISP neural pipelines (HDR fusion, denoise, scene segmentation, smart HDR, Cinematic mode).
  • On-device speech recognition (Siri, Pixel Recorder transcription, Windows Live Captions, dictation).
  • Real-time meeting effects (background blur, eye contact correction, noise suppression — Microsoft Studio Effects, Apple FaceTime effects).
  • Biometric authentication (Face ID, Windows Hello with enhanced sign-in security).
  • Photo indexing and semantic search.

Increasingly running on NPU in 2025-2026.

  • On-device small LLMs (Phi-3.5 Mini, Gemma 2 2B, Llama 3.2 1B/3B, Apple Foundation Models) for completion, summarization, and Copilot/Apple Intelligence features.
  • Image generation at small resolution (Stable Diffusion variants tuned for NPU).
  • Document understanding (PDF semantic search, automatic alt text).
  • Recall-style retrieval embeddings over local content.

Still on GPU, not NPU.

  • Anything 13B+ parameters at usable speed.
  • Image generation at native resolution.
  • Most creative-app neural features (Adobe Firefly, DaVinci Resolve neural engine).
  • Anything that benefits from large batch size.

The dividing line is shifting toward the NPU year-over-year as memory bandwidth improves and on-device model quality reaches usable thresholds at smaller parameter counts.

What developers can actually do with the NPU

For most engineers, "use the NPU" means: export your model to ONNX, run it through ONNX Runtime with the appropriate execution provider (QNN for Qualcomm, DirectML for Windows generally, OpenVINO for Intel, Core ML for Apple, Ryzen AI for AMD), and let the runtime place ops on the NPU where it can and fall back to GPU/CPU where it cannot.

Direct NPU programming is rare and vendor-specific. There is no NPU equivalent of CUDA — no single language, no portable kernel programming model, no broadly-shared toolchain. This is the current biggest gap and the place vendors are competing.

The practical workflow:

  1. Train on GPU (cloud or workstation), produce an ONNX or Core ML model.
  2. Quantize to INT8 or INT4 using vendor tooling (QNN AI Hub, Intel Neural Compressor, ONNX Runtime quantization, Core ML Tools).
  3. Validate accuracy retention on a representative dataset.
  4. Deploy and let the runtime route to the NPU.

Quantization is the make-or-break step. NPUs are dramatically faster at INT8 than FP16 and faster again at INT4, but each precision drop costs some accuracy. The art is choosing the precision tier that preserves task-relevant accuracy on your specific workload.

Where this goes next

Three threads to watch through 2026-2027:

Memory bandwidth catches up to compute. LPDDR5X gave way to LPDDR6 and on-package memory; Apple, AMD, and Intel have all signaled unified-memory architectures with NPU access at GPU-class bandwidth. This unlocks the 13B+ parameter on-device tier.

The software stack consolidates. ONNX Runtime is becoming the lowest-common-denominator deployment target across NPUs. Core ML stays Apple-only but is broadly mature. Microsoft's DirectML and the Windows Copilot Runtime are converging on a single Windows-side abstraction.

The workload boundary keeps shifting. Expect the on-device LLM tier to move from 1-3B parameters today to 7-13B parameters by late 2026, with retrieval-augmented generation over local data as the killer pattern. The cloud will keep the frontier models; the NPU takes everything that is good-enough on-device and benefits from privacy, latency, or offline use.

The NPU is not a passing fad and not the end of the GPU. It is the third leg of the consumer compute stack, and the next two years are about figuring out what it actually wants to do.

More Research

Independent research and analysis from 15+ years of building in cybersecurity, AI, and SaaS

Cybersecurity Foundations

The AI Security Stack of 2026: Governance, Red Teaming, MLSecOps, Threat Detection, and Agentic Defense

How the five layers of AI security actually fit together — and what to build first

13 minRead →

Cybersecurity Foundations

Application Security 101: SAST, DAST, IAST, ASPM, SCA, and the Modern AppSec Stack

How the application security toolchain actually fits together, what each acronym does, and where to start

16 minRead →

Frontier AI Models

Grok AI Explained: xAI's Model Family, Capabilities, and Where It Fits

How Grok works, what makes it different from ChatGPT and Claude, and what it is actually good at

11 minRead →

Cybersecurity Foundations

Zero Trust Architecture Explained: SASE, SSE, ZTNA, and How the Pieces Actually Fit

The vendor-neutral guide to Zero Trust: what NIST 800-207 actually says, how SASE and SSE differ, where ZTNA fits, and what to build first

17 minRead →

Industry Research & Market Analysis

AI Receptionists for SMBs: Market Data, ROI, and Implementation Guide

How AI Receptionists Are Rewiring SMB Communication with 75% Fewer Missed Calls and 300% First-Year ROI

20 minRead →

Industry Research & Market Analysis

Generative Engine Optimization (GEO): Market Research & Industry Analysis 2026

A Deep Analysis of Monitoring & Content Platforms, Market Gaps, and Strategic Opportunities

25 minRead →

Industry Research & Market Analysis

CIAM Industry Research Report: M&A and Investment Analysis

Comprehensive Market Intelligence for Private Equity, Growth Equity, and Venture Capital Firms

35 minRead →

Industry Insights & Analysis

California's DROP: The First-of-Its-Kind Data Deletion Platform That Could Reshape Global Privacy Standards

How California's DELETE Act and DROP platform are transforming data privacy enforcement

14 minRead →

Authentication & Cryptography

The Complete Guide to Password Hashing: Argon2 vs Bcrypt vs Scrypt vs PBKDF2 (2026)

Benchmarking and comparing modern password hashing algorithms for secure credential storage

25 minRead →

Technical Implementation Guides

Model Context Protocol (MCP): Enterprise Adoption, Market Trends & Implementation

The Complete Guide to MCP, Architecture, Security, Authentication, and Strategic Deployment for Enterprises

35 minRead →

Strategic Frameworks & Playbooks

How Companies Can Achieve AEO and GEO: The Complete 2025 Guide

Optimizing content for AI search visibility through AEO and GEO strategies

18 minRead →

Industry Research & Market Analysis

The Complete Guide to AI-Powered Visual Content Creation

Comprehensive Analysis of AI Image Editing, Generation, and Restoration Platforms Serving 50M+ Creators

30 minRead →

Strategic Frameworks & Playbooks

The Complete Guide to Setting up your US Tech Startup

Foundational decisions for entity selection, banking, payments, and compliance

13 minRead →

Industry Research & Market Analysis

AI Voiceover & Text-to-Speech: A Comprehensive Analysis

Technology, Use Cases, and Market Landscape for AI Voice Synthesis in 2025

25 minRead →

Industry Research & Market Analysis

AI Chat with PDF: Complete Guide & Top Tools

Comprehensive Analysis of the AI Document Interaction Market, Leading Platforms, and Industry Applications

30 minRead →

Industry Insights & Analysis

How Model Context Protocol Servers Facilitate Real-Time Decision Making in AI

Understanding MCP servers' role in enabling AI systems to access live data for instantaneous decisions

6 minRead →

Buyer's Guides & Solution Comparisons

CIAM Security Buyers' Guide 2025: 25 Essential Solutions

Essential Capabilities for Securing Customer Identity and Access Management

30 minRead →

Buyer's Guides & Solution Comparisons

Know Your Customer (KYC) Buyers' Guide 2025

25 Essential Solutions for Customer Verification and Compliance

30 minRead →

Buyer's Guides & Solution Comparisons

Privileged Access Management (PAM) Buyers' Guide 2025

25 Essential Tools for Privileged Access Security

30 minRead →

Buyer's Guides & Solution Comparisons

Workplace Identity & Access Management (IAM) Buyers' Guide 2025

25 Essential IAM Tools and Strategies to Strengthen Your Security Posture

30 minRead →

Authentication & Cryptography

The Future of Hashing: Quantum Resistance and Beyond

How cryptographic hashing must evolve to withstand quantum computing threats

22 minRead →

Authentication & Cryptography

Data Integrity Verification: Implementing Checksums and Hash Verification

Practical guide to implementing checksums and hash verification for data integrity

20 minRead →

Industry Insights & Analysis

Akamai's Identity Cloud Shutdown: The Migration Crisis That's Reshaping Enterprise Authentication

How 1,000+ enterprises face forced migration from Akamai's Identity Cloud

13 minRead →

Buyer's Guides & Solution Comparisons

Best IAM Solutions 2025: Complete Buyer's Guide

Navigating the $24+ billion IAM market with a comparison of 29 leading identity solutions

30 minRead →

Strategic Frameworks & Playbooks

AI Marketing Strategy for B2B SaaS: Expert Implementation

Strategic guide to AI-powered marketing intelligence for B2B SaaS companies

14 minRead →

Strategic Frameworks & Playbooks

The AI Revolution Toolkit: Strategic Framework for Building AI-Powered B2B SaaS Solutions

Frameworks for evaluating and integrating AI across B2B SaaS operations

14 minRead →

Strategic Frameworks & Playbooks

Essential DevOps Tools for B2B SaaS: Founder's Guide

A curated guide to the tools that power modern B2B SaaS infrastructure

9 minRead →

Strategic Frameworks & Playbooks

Building Enterprise Cybersecurity: A Strategic Guide to Security Categories for B2B SaaS

Essential security categories for competing in enterprise B2B SaaS markets

13 minRead →

Buyer's Guides & Solution Comparisons

Comprehensive CIAM Providers Directory: Top Identity Authentication Solutions

Expert analysis of 30+ CIAM solutions across six provider categories

35 minRead →

Strategic Frameworks & Playbooks

Enterprise CIAM Strategy Guide: Implementation & ROI Framework

Implementation frameworks, vendor evaluation, and ROI analysis for enterprise CIAM

13 minRead →

AI Deep Dives

The Complete Guide to Grok AI: Applications, Technical Analysis, and Implementation for Business Leaders

Everything business leaders need to evaluate and implement Grok AI

20 minRead →

AI Deep Dives

Grok AI - Core Concepts, Capabilities, Technical Foundation

Understanding Grok AI's architecture, training methodology, and distinctive capabilities

30 minRead →

AI Deep Dives

Grok 3 Architecture: How It Works Under the Hood

Deep-dive into Grok AI's transformer architecture, benchmarks, and engineering insights

28 minRead →

AI Deep Dives

Grok 3 vs ChatGPT vs Claude, Which AI Wins in 2026?

Comprehensive comparison of leading LLMs across performance, safety, and cost

19 minRead →

Authentication & Cryptography

bcrypt, scrypt, and Argon2: Choosing the Right Password Hashing Algorithm

A comparative analysis of leading password hashing algorithms for different security requirements

22 minRead →

Authentication & Cryptography

BLAKE2 & BLAKE3: Fast & Secure Hashing Options

High-performance hashing alternatives to traditional algorithms like SHA-2 and SHA-3

20 minRead →

Authentication & Cryptography

Secure Password Storage: Best Practices with Modern Hashing Algorithms

A comprehensive guide to modern password hashing techniques and implementation best practices

25 minRead →

Technical Implementation Guides

CIAM 101: A Practical Guide to Customer Identity and Access Management in 2025

From basic authentication to intelligent identity platforms

25 minRead →

Technical Implementation Guides

CIAM Implementation Guide: 5 Key Components & Best Practices 2025

Essential components and configuration for scalable identity solutions

30 minRead →

Technical Implementation Guides

CIAM Performance Optimization and Scalability Guide

Enterprise-scale authentication optimization for millions of users

26 minRead →

Technical Implementation Guides

CIAM Security Best Practices & Templates Guide 2025 | Implementation

Enterprise-grade security controls and implementation templates for CIAM systems

28 minRead →

Authentication & Cryptography

MD5: Understanding its Uses, Vulnerabilities, and Why It's Still Around

Examining MD5's cryptographic weaknesses and its persistent role in non-security applications

20 minRead →

Authentication & Cryptography

SHA-2 Family: Choosing Between SHA-256, SHA-384, and SHA-512

Analyzing the architectural differences, performance trade-offs, and use cases of SHA-2 variants

22 minRead →

Authentication & Cryptography

Passwordless Authentication Implementation Checklist

A structured approach to transitioning from passwords to passwordless authentication

18 minRead →

Buyer's Guides & Solution Comparisons

Passwordless Authentication Solution Selection Matrix

A comparative framework for evaluating passwordless authentication methods across organizational needs

15 minRead →