Quantization - SkrewAI

LLM

LLM Quantization Explained: INT8, INT4, GPTQ & AWQ

A technical breakdown of how LLM quantization works, comparing INT8, INT4, GPTQ, and AWQ methods that shrink large models for faster, cheaper inference without destroying accuracy.

Quantization

Recover-LoRA: Restoring Accuracy in 2-Bit LLMs

A new technique called Recover-LoRA uses low-rank adaptation and knowledge distillation on synthetic data to reclaim accuracy lost during aggressive 2-bit quantization of language models, enabling far more efficient deployment.

Together AI

Together AI Open-Sources OSCAR for 2-Bit KV Cache

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantization system that slashes memory costs for long-context LLM serving while preserving accuracy across reasoning and retrieval benchmarks.

edge AI

Edge LLMs Are Memory Bound: LiteRT Hits 30 Tok/s

Edge LLM inference is bottlenecked by memory bandwidth, not compute. Learn how LiteRT trades compute for bandwidth to achieve 30 tokens per second on resource-constrained devices through quantization and optimized memory access patterns.

LLM

Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF & RAG

A technical walkthrough of deploying PrismML's Bonsai 1-bit LLM on CUDA using GGUF quantization, with benchmarking, structured JSON output, chat, and retrieval-augmented generation pipelines.

model-compression

OneComp: One-Line Model Compression for Generative AI

A new framework called OneComp promises to compress generative AI models with a single line of code, potentially making diffusion and video generation models far more deployable at the edge.

LLM

LLM Quantization Explained: FP32, FP16, BF16, and INT8 Formats

Understanding numeric precision formats is crucial for deploying AI models efficiently. Learn how FP32, FP16, BF16, and INT8 quantization affects model performance, memory usage, and inference speed.

LLM Optimization

Persistent Q4 KV Cache Enables Multi-Agent LLM on Edge

New research introduces quantized KV cache persistence for running multi-agent LLM systems on resource-constrained edge hardware, enabling local AI agents without cloud dependency.

LLM

AutoQRA: Joint Quantization and LoRA for Efficient LLM Training

New research introduces AutoQRA, a framework that jointly optimizes mixed-precision quantization and low-rank adapters, enabling more efficient fine-tuning of large language models on limited hardware.

edge AI

HQP: Hybrid Quantization-Pruning for Edge AI Inference

New research combines sensitivity-aware quantization and pruning to enable ultra-low-latency AI inference on edge devices, potentially transforming how generative models deploy on mobile hardware.

LLM efficiency

Dynamic Mix Precision Routing Optimizes Multi-Step LLM Efficiency

New research proposes dynamic precision routing to optimize computational resources across multi-step LLM interactions, balancing quality and efficiency through adaptive quantization strategies.

LLM Optimization

How Quantization and Batching Cut LLM Energy Costs

New research explores how quantization, batching strategies, and serving optimizations dramatically reduce LLM energy consumption while maintaining performance.