TierInfer — Run 70B Models on Consumer GPUs
A patent-pending inference engine designed to run large dense transformer models on consumer GPU systems by streaming weights across VRAM, system RAM, and NVMe storage.
USPTO Provisional Patent Application #63/995,308, filed March 3, 2026.
For business leaders
Cap your inference spend by buying hardware once instead of paying per token forever — and keep regulated, IP-sensitive data inside your network.
For engineers
Three-tier weight streaming across VRAM → RAM → NVMe with prefetch scheduled ahead of execution — designed so a single 16 GB consumer GPU can target 70B-class dense transformers.
What makes TierInfer different
Five capabilities combined into one commercial on-prem deployment model
- 1
Targets 70B-class models on consumer hardware
Architected to run 70B-class dense transformers on a single 16 GB consumer GPU — no multi-GPU rig, no H100 cluster required.
- 2
Three-tier weight streaming (patent-pending approach)
VRAM, system RAM, and NVMe treated as one coherent memory hierarchy — covered by USPTO Provisional Patent Application #63/995,308.
- 3
Prefetch scheduled ahead of execution
Layer transfers are scheduled to overlap with active compute, designed to reduce GPU stalls while waiting on cold layers.
- 4
On-premise by default
Runs entirely on hardware you own. No telemetry, no per-token billing, no proprietary data leaving your network.
- 5
Drops in behind existing application code
Targets an OpenAI-compatible HTTP surface for general availability so existing inference clients keep working unchanged.
The Problem
Frontier LLMs are gated by enterprise GPU clusters
Hardware capped
When the full checkpoint has to fit in VRAM, consumer configurations are typically constrained well below frontier scale. Anything larger has historically required enterprise H100 / H200 clusters most businesses cannot justify.
Cloud is expensive at scale
Hosted inference on frontier models can run six figures per year for high-volume workloads, with throughput that degrades unpredictably under load.
Proprietary data leaves the building
Every cloud inference call ships your data outside your security perimeter. For regulated, IP-sensitive, or contractually constrained workloads, that is a non-starter.
The Approach
Three-tier weight streaming
TierInfer treats local hardware as a three-tier inference memory system. Instead of requiring the full model to fit in GPU memory, the engine keeps the active working set in VRAM, stages upcoming layers in RAM, and stores the full checkpoint on NVMe — scheduling weight movement ahead of execution so transfers can overlap with GPU compute.
Full 70B model weights at rest
Prefetched layers staged for VRAM
Active layer + KV cache feeding GPU
Live diagram. Layer blocks stream NVMe → RAM → VRAM while the GPU consumes the active layer in parallel.
Tier 1
VRAM
Holds the active layer, KV cache, and hot working set being consumed by the GPU.
Tier 2
System RAM
Stages upcoming model blocks using pinned memory and asynchronous transfers.
Tier 3
NVMe
Stores the full quantized checkpoint and streams colder blocks on demand.
Goal
Enable 70B-class dense models on a single 16 GB consumer GPU — for example an RTX 5080 paired with 32 GB DDR5 and an NVMe SSD — by streaming weights from NVMe through RAM into VRAM. Reference platform for current development.
Specifications (preview)
What runs, where it runs
Final compatibility matrix and integration surface will be confirmed before general availability. The targets below reflect current development scope.
Model architecture
Dense decoder-only transformers (LLaMA, Mistral, Qwen architecture families). MoE support evaluated separately.
Parameter scale
Three-tier loading demonstrated for 70B-class checkpoints on a single 16 GB consumer GPU. End-to-end throughput validation in progress. Larger checkpoints constrained primarily by NVMe capacity.
Hardware target
NVIDIA consumer GPUs with ≥8 GB VRAM (RTX 40-series, RTX 50-series). Reference platform: RTX 5080, 32 GB DDR5, NVMe SSD.
Operating system
Linux and Windows.
Integration surface
OpenAI-compatible HTTP API targeted for GA — existing inference clients keep working without code changes.
Deployment
On-premise only by design. Single-binary or container install on a workstation or server. No telemetry, no cloud dependency.
Licensing
Commercial license for production deployment. Patent-pending approach (USPTO Provisional #63/995,308).
Status
Pre-release research project. Enterprise design partners welcome — contact us to discuss workloads and timeline.
Benchmarks
Early measurements
Provisional numbers from internal runs. Updated end-to-end throughput and cloud cost-comparison data will be published before general availability.
3B-class model
milestone reachedFirst coherent generation achieved end-to-end through the streaming pipeline. Throughput varies by checkpoint, quantization, and build configuration; final numbers will be published before general availability.
70B-class model
three-tier load demonstratedInternal layer-distribution snapshot from a 70B-class checkpoint loaded across the three tiers on a 16 GB consumer GPU:
In progress
End-to-end throughput validation for 70B-class generation, plus a head-to-head comparison vs. cloud inference cost per million tokens, will be published before TierInfer enters general availability.
Why It Matters
Frontier inference, on hardware you already own
Lower inference cost
Replace recurring cloud inference spend with a one-time hardware investment for predictable workloads.
Data stays on-premise
Proprietary data, regulated workloads, and sensitive customer information never leave your network.
Deploy on existing fleets
Run modern LLMs on consumer GPUs already deployed in workstations and edge servers — no cluster rebuild required.
Built by PulseSpark AI
A Pittsburgh AI company building inference infrastructure
TierInfer is developed by PulseSpark AI, a Pittsburgh-based AI company building LLM inference infrastructure, open-source AI agents (Rivet), and deploying AI for business. We are members of the NVIDIA Inception Program and Microsoft for Startups Founders Hub.