LLM Memory Calculator

VRAM Calculator for AI VRAM Calculator for AI

Calculate GPU memory requirements for running and training Large Language Models. Get instant VRAM estimates for inference, training, and LoRA fine-tuning.

Accurate Formulas
GPU Recommendations
10+ Model Presets
All Quantizations
Calculate VRAM Now
Client-Side Only
Instant Results
180K+ Monthly Searches
| |
FP16: Near-original quality, 50% memory savings
2048
128 32K
1
1 64

Complete Guide to VRAM Requirements for Large Language Models

Calculate exact GPU memory needed to run or train any LLM instantly. Our VRAM calculator helps you determine memory requirements for Llama, Mistral, GPT models across all precision formats. Avoid out-of-memory errors and find the right GPU for your AI workload with precise mathematical formulas used by ML engineers worldwide.

Understanding VRAM Requirements for AI Models

VRAM (Video RAM) is the dedicated memory on your graphics card that stores model weights, activations, and intermediate computations during AI inference or training. Unlike system RAM, VRAM provides the high bandwidth needed for matrix operations in neural networks as documented by NVIDIA's optimization guidelines. Running a 7 billion parameter model in FP16 precision requires approximately 14GB of VRAM just for the model weights, plus additional memory for KV cache and activations calculated using formulas from Attention Is All You Need. Understanding these requirements prevents expensive hardware mistakes and deployment failures. Use our Bottleneck Calculator to ensure balanced system performance.

VRAM Calculation Components:

Memory Breakdown
  • Model Weights: Parameters × Precision bits ÷ 8 × 1.2 (see HuggingFace docs)
  • KV Cache: Stores attention keys and values for context (research paper)
  • Activations: Intermediate layer outputs during forward pass
  • Optimizer States: Adam optimizer uses 8 bytes per parameter for training
  • Gradients: Same size as model weights for backpropagation
Quick Rules of Thumb
  • Inference FP16: ~2GB per billion parameters (PyTorch AMP)
  • Inference INT8: ~1GB per billion parameters (INT8 quantization)
  • Inference INT4: ~0.5GB per billion parameters (GPTQ paper)
  • Training FP16: ~8GB per billion parameters
  • LoRA Fine-tuning: ~3GB per billion parameters (use our PSU Calculator)

Four Use Case Modes:

⚡ Inference
Running models for predictions with minimal memory overhead (TensorFlow guide)
🎓 Training
Full training from scratch with optimizer states and gradients
🔧 Fine-tuning
Adapting pre-trained models for specific tasks (HuggingFace guide)
🎯 LoRA
Parameter-efficient training with low-rank adapters

Why Accurate VRAM Calculation Matters:

  • Hardware Planning: Buy the right GPU without overspending on VRAM
  • Avoid OOM Errors: Prevent out-of-memory crashes during production
  • Cost Optimization: Choose cloud instances that match your exact needs
  • Batch Size Planning: Maximize throughput within memory constraints
  • Quantization Decisions: Balance quality vs memory tradeoffs intelligently

Quantization Guide: Memory vs Quality Tradeoffs

FP32 (32-bit Floating Point)

Original training precision with maximum accuracy per IEEE 754 standard but 4× memory usage

Memory per Param: 4 bytes
7B Model Size: ~28 GB
Use Case: Research only

FP16 / BF16 (16-bit Floating Point)

Most common inference precision recommended by PyTorch AMP with near-original quality and 50% memory savings

Memory per Param: 2 bytes
7B Model Size: ~14 GB
Quality Loss: Negligible

INT8 (8-bit Integer)

Quantized precision using post-training quantization offering 75% memory reduction with minimal quality degradation

Memory per Param: 1 byte
7B Model Size: ~7 GB
Quality Loss: 5-10% perplexity

INT4 / 4-bit (GPTQ, AWQ)

Aggressive quantization using GPTQ algorithm for consumer GPUs with 87.5% memory savings

Memory per Param: 0.5 bytes
7B Model Size: ~3.5 GB
Quality Loss: 10-20% perplexity

2-bit (Extreme Quantization)

Experimental ultra-low precision for resource-constrained devices

Memory per Param: 0.25 bytes
7B Model Size: ~1.75 GB
Quality Loss: Significant

VRAM Requirements for Popular LLM Models

Llama 3 7B - Most Popular Open Model

Meta's flagship Llama 3 model delivering GPT-3.5 level performance with efficient memory usage. Calculate power needs with our PSU Calculator

FP16 Inference
14 GB
RTX 4090, A100 40GB
INT8 Inference
7 GB
RTX 3090, RTX 4070 Ti
INT4 Inference
3.5 GB
RTX 3060, consumer GPUs
LoRA Training
20 GB
RTX 4090, A100 40GB

Llama 3 70B - Production Quality Model

State-of-the-art open model from Meta AI Research competing with GPT-4 requiring substantial VRAM for multi-GPU setups

FP16 Inference
140 GB
2× A100 80GB
INT8 Inference
70 GB
1× A100 80GB or H100
INT4 Inference
35 GB
A100 40GB, RTX 6000 Ada
Full Training
560 GB
8× A100 80GB cluster

Mistral 7B - Efficient Performance Leader

Outperforms Llama 2 13B with only 7B parameters using sliding window attention

FP16 Inference
14 GB
RTX 4090, V100 16GB
INT8 Inference
7 GB
RTX 3090, RTX 4080
INT4 Inference
3.5 GB
GTX 1080 Ti, RTX 3060
LoRA Training
18 GB
RTX 4090, RTX A5000

Mixtral 8x7B - Mixture of Experts Architecture

47B total parameters using MoE architecture but only 13B active per token for efficient inference

FP16 Inference
94 GB
2× A100 80GB (all experts loaded)
INT8 Inference
47 GB
A100 80GB, H100
INT4 Inference
24 GB
RTX 4090, RTX A6000
Active Memory
26 GB
Per-token (FP16)

GPT-3 175B - Industry Benchmark

The GPT-3 model from OpenAI that started the LLM revolution, requires significant infrastructure and multi-GPU clusters

FP16 Inference
350 GB
5× A100 80GB minimum
INT8 Inference
175 GB
3× A100 80GB or 2× H100
INT4 Inference
88 GB
2× A100 80GB
Full Training
1.4 TB
Enterprise clusters only

Phi-2 2.7B - Microsoft's Efficient Gem

Microsoft Research's Phi-2 tiny model with surprising capabilities, perfect for consumer hardware and edge devices

FP16 Inference
5.4 GB
GTX 1080, RTX 3060
INT8 Inference
2.7 GB
GTX 1060, most GPUs
INT4 Inference
1.4 GB
Integrated graphics possible
Full Training
22 GB
RTX 4090, RTX A5000

GPU Recommendations by VRAM Capacity

Consumer GPUs (8-16 GB VRAM)

Entry-level options for small models and quantized inference. Compare specifications on NVIDIA GeForce and AMD Radeon

NVIDIA RTX 3060
VRAM: 12 GB GDDR6
Best For: Llama 7B INT4, Phi-2 FP16
Price: ~$300
Budget-friendly starter
AMD RX 7600 XT
VRAM: 16 GB GDDR6
Best For: Llama 7B INT8, small training
Price: ~$330
Great value option
NVIDIA RTX 4060 Ti
VRAM: 16 GB GDDR6X
Best For: Mistral 7B FP16, LoRA experiments
Price: ~$500
Modern efficiency

Enthusiast GPUs (20-24 GB VRAM)

High-performance options for serious AI development and larger models

NVIDIA RTX 3090
VRAM: 24 GB GDDR6X
Best For: Llama 13B FP16, Mistral 7B training
Price: ~$800 (used)
Best value 24GB
NVIDIA RTX 4090
VRAM: 24 GB GDDR6X
Best For: Llama 13B training, 70B INT4
Price: ~$1,600
Fastest consumer GPU
NVIDIA RTX A5000
VRAM: 24 GB GDDR6
Best For: Professional workloads, stable drivers
Price: ~$2,000
Workstation grade

Professional GPUs (40-48 GB VRAM)

Professional-grade hardware for production deployments and large models

NVIDIA A40
VRAM: 48 GB GDDR6
Best For: Llama 70B INT8, Mixtral INT4
Price: ~$4,500
Data center ready
NVIDIA L40S
VRAM: 48 GB GDDR6
Best For: High throughput inference, multi-modal
Price: ~$8,000
Latest generation
NVIDIA RTX 6000 Ada
VRAM: 48 GB GDDR6
Best For: Workstation AI, visualization + inference
Price: ~$6,800
Workstation flagship

Enterprise GPUs (80+ GB VRAM)

Top-tier NVIDIA data center accelerators for the largest models and production scale deployments

NVIDIA A100 80GB
VRAM: 80 GB HBM2e
Best For: Llama 70B FP16, training large models
Price: ~$15,000
Gold standard
NVIDIA H100 80GB
VRAM: 80 GB HBM3
Best For: Fastest training, GPT-3 inference
Price: ~$30,000
Cutting edge
NVIDIA H100 94GB
VRAM: 94 GB HBM3
Best For: Maximum capacity for massive models
Price: ~$35,000
Ultimate performance

Memory Optimization Techniques

1. Flash Attention - KV Cache Optimization

Reduce KV cache memory by 10-15% with fused attention kernels

Standard Attention:
• Stores all attention matrices
• KV cache: 2 × layers × hidden × seq × batch
Memory intensive
Flash Attention:
• Fused operations, less materialization
• Reduces memory by tiling and recomputation
10% memory savings

2. Gradient Checkpointing - Training Memory Reduction

Save 70% activation memory by recomputing during backward pass

Without Checkpointing:
• Store all layer activations
• Memory scales with model depth
High memory usage
With Checkpointing:
• Only store checkpoint layers
• Recompute others during backprop
70% activation savings

3. LoRA - Parameter Efficient Fine-Tuning

Train models with 90% less memory by only updating low-rank adapters as detailed in Microsoft's research on LoRA implementation

Full Fine-tuning:
• Update all model parameters
• Requires optimizer states for billions of params
Llama 7B: ~56 GB VRAM
Very expensive
LoRA Fine-tuning:
• Freeze base model, train adapters
• Trainable params: <1% of total
Llama 7B: ~20 GB VRAM
Highly efficient

4. Model Parallelism - Split Across Multiple GPUs

Distribute model layers using PyTorch model parallelism across GPUs when single GPU insufficient. Also see HuggingFace multi-GPU guide

Tensor Parallelism: Split individual layer computations across GPUs
Pipeline Parallelism: Place different layers on different GPUs
Example: Run Llama 70B FP16 (140GB) on 2× A100 80GB instead of needing 2× H100
Enables larger models on available hardware

5. Mixed Precision Training - Speed and Memory Benefits

Use Automatic Mixed Precision (AMP) for FP16 computation while keeping FP32 master weights per NVIDIA's research

Activations & Gradients: Stored in FP16 (2 bytes/param)
Master Weights: Kept in FP32 for numerical stability
Memory Savings: 30-40% compared to full FP32 training
Speed Boost: 2-3× faster on Tensor Core GPUs
Standard practice for modern training

6. Sequence Length Optimization - Context Window Management

KV cache scales linearly with sequence length, optimize for your use case

2K Context:
Llama 7B FP16
~14 GB
8K Context:
Llama 7B FP16
~18 GB
32K Context:
Llama 7B FP16
~30 GB
Use shortest context needed for your task

Common VRAM Calculation Mistakes to Avoid

❌ Forgetting KV Cache Memory

Many calculate only model weights, but KV cache can add 20-50% more memory depending on sequence length and batch size. Always include it in estimates or use our calculator above.

❌ Ignoring CUDA Overhead

PyTorch and CUDA frameworks need 0.5-1GB for themselves. Apply 20% overhead multiplier to avoid running out of VRAM at the last moment.

❌ Confusing Training vs Inference

Training requires 4× more memory than inference due to optimizer states and gradients. Don't buy a 24GB GPU expecting to train what you can only run for inference.

❌ Not Accounting for Batch Size

Increasing batch size from 1 to 8 can multiply VRAM usage by 4-6×. Always calculate for your target throughput, not just batch size 1.

❌ Overestimating Quantization Quality

INT4 quantization saves tons of memory but degrades quality significantly. Test your use case before committing to aggressive quantization levels.

❌ Mixing Up MoE Architecture Memory

Mixtral 8x7B has 47B total parameters per MoE research but still requires memory for ALL experts loaded. Only compute is reduced, not VRAM needs.