Research Project · USPTO Provisional Patent Application #63/995,308

TierInfer — Run 70B Models on Consumer GPUs

A patent-pending inference engine designed to run large dense transformer models on consumer GPU systems by streaming weights across VRAM, system RAM, and NVMe storage.

USPTO Provisional Patent Application #63/995,308, filed March 3, 2026.

Inquire About Enterprise Deployment How It Works

For business leaders

Cap your inference spend by buying hardware once instead of paying per token forever — and keep regulated, IP-sensitive data inside your network.

For engineers

Three-tier weight streaming across VRAM → RAM → NVMe with prefetch scheduled ahead of execution — designed so a single 16 GB consumer GPU can target 70B-class dense transformers.

What makes TierInfer different

Five capabilities combined into one commercial on-prem deployment model

Targets 70B-class models on consumer hardware

Architected to run 70B-class dense transformers on a single 16 GB consumer GPU — no multi-GPU rig, no H100 cluster required.

Three-tier weight streaming (patent-pending approach)

VRAM, system RAM, and NVMe treated as one coherent memory hierarchy — covered by USPTO Provisional Patent Application #63/995,308.

Prefetch scheduled ahead of execution

Layer transfers are scheduled to overlap with active compute, designed to reduce GPU stalls while waiting on cold layers.

On-premise by default

Runs entirely on hardware you own. No telemetry, no per-token billing, no proprietary data leaving your network.

Drops in behind existing application code

Targets an OpenAI-compatible HTTP surface for general availability so existing inference clients keep working unchanged.

The Problem

Frontier LLMs are gated by enterprise GPU clusters

Hardware capped

When the full checkpoint has to fit in VRAM, consumer configurations are typically constrained well below frontier scale. Anything larger has historically required enterprise H100 / H200 clusters most businesses cannot justify.

Cloud is expensive at scale

Hosted inference on frontier models can run six figures per year for high-volume workloads, with throughput that degrades unpredictably under load.

Proprietary data leaves the building

Every cloud inference call ships your data outside your security perimeter. For regulated, IP-sensitive, or contractually constrained workloads, that is a non-starter.

The Approach

Three-tier weight streaming

TierInfer treats local hardware as a three-tier inference memory system. Instead of requiring the full model to fit in GPU memory, the engine keeps the active working set in VRAM, stages upcoming layers in RAM, and stores the full checkpoint on NVMe — scheduling weight movement ahead of execution so transfers can overlap with GPU compute.

GPU Computeattention pass · always in flight

NVMe1 TB+ · cold

Full 70B model weights at rest

System RAM32 GB · warm

Prefetched layers staged for VRAM

VRAM16 GB · hot

Active layer + KV cache feeding GPU

Async prefetch overlap with computeNVMe → RAM → VRAM → GPU

Live diagram. Layer blocks stream NVMe → RAM → VRAM while the GPU consumes the active layer in parallel.

Tier 1

VRAM

Holds the active layer, KV cache, and hot working set being consumed by the GPU.

Tier 2

System RAM

Stages upcoming model blocks using pinned memory and asynchronous transfers.

Tier 3

NVMe

Stores the full quantized checkpoint and streams colder blocks on demand.

Goal

Enable 70B-class dense models on a single 16 GB consumer GPU — for example an RTX 5080 paired with 32 GB DDR5 and an NVMe SSD — by streaming weights from NVMe through RAM into VRAM. Reference platform for current development.

Specifications (preview)

What runs, where it runs

Final compatibility matrix and integration surface will be confirmed before general availability. The targets below reflect current development scope.

Model architecture

Dense decoder-only transformers (LLaMA, Mistral, Qwen architecture families). MoE support evaluated separately.

Parameter scale

Three-tier loading demonstrated for 70B-class checkpoints on a single 16 GB consumer GPU. End-to-end throughput validation in progress. Larger checkpoints constrained primarily by NVMe capacity.

Hardware target

NVIDIA consumer GPUs with ≥8 GB VRAM (RTX 40-series, RTX 50-series). Reference platform: RTX 5080, 32 GB DDR5, NVMe SSD.

Operating system

Linux and Windows.

Integration surface

OpenAI-compatible HTTP API targeted for GA — existing inference clients keep working without code changes.

Deployment

On-premise only by design. Single-binary or container install on a workstation or server. No telemetry, no cloud dependency.

Licensing

Commercial license for production deployment. Patent-pending approach (USPTO Provisional #63/995,308).

Status

Pre-release research project. Enterprise design partners welcome — contact us to discuss workloads and timeline.

Benchmarks

Early measurements

Provisional numbers from internal runs. Updated end-to-end throughput and cloud cost-comparison data will be published before general availability.

3B-class model

milestone reached

First coherent generation achieved end-to-end through the streaming pipeline. Throughput varies by checkpoint, quantization, and build configuration; final numbers will be published before general availability.

70B-class model

three-tier load demonstrated

Internal layer-distribution snapshot from a 70B-class checkpoint loaded across the three tiers on a 16 GB consumer GPU:

4VRAM layers0RAM layers76NVMe layers

In progress

End-to-end throughput validation for 70B-class generation, plus a head-to-head comparison vs. cloud inference cost per million tokens, will be published before TierInfer enters general availability.

Why It Matters

Frontier inference, on hardware you already own

Lower inference cost

Replace recurring cloud inference spend with a one-time hardware investment for predictable workloads.

Data stays on-premise

Proprietary data, regulated workloads, and sensitive customer information never leave your network.

Deploy on existing fleets

Run modern LLMs on consumer GPUs already deployed in workstations and edge servers — no cluster rebuild required.

Built by PulseSpark AI

A Pittsburgh AI company building inference infrastructure

TierInfer is developed by PulseSpark AI, a Pittsburgh-based AI company building LLM inference infrastructure, open-source AI agents (Rivet), and deploying AI for business. We are members of the NVIDIA Inception Program and Microsoft for Startups Founders Hub.

NVIDIA Inception

Microsoft for Startups Founders Hub

Interested in TierInfer for enterprise deployment? Get in touch.