edge AI - SkrewAI

deepfake detection

Scam.ai Unveils Halo Deepfake Detector, Qualcomm Deal

Scam.ai launched its Halo deepfake detection model and announced a Qualcomm partnership at Computex 2026, aiming to bring real-time synthetic media detection to edge devices and on-device hardware.

Perplexity AI

Perplexity Launches Hybrid Local-Cloud AI Inference Router

Perplexity AI unveiled a hybrid inference orchestrator that automatically routes AI tasks between on-device models and cloud servers on personal computers, balancing latency, privacy, and compute cost.

edge AI

Edge LLMs Are Memory Bound: LiteRT Hits 30 Tok/s

Edge LLM inference is bottlenecked by memory bandwidth, not compute. Learn how LiteRT trades compute for bandwidth to achieve 30 tokens per second on resource-constrained devices through quantization and optimized memory access patterns.

Liquid AI

Liquid AI's LFM2.5-350M: Big Performance, Tiny Model

Liquid AI releases a 350M parameter model trained on 28 trillion tokens with scaled reinforcement learning, challenging assumptions about what compact models can achieve.

Nvidia

NVIDIA Nemotron 3 Nano 4B: Hybrid Architecture for Edge AI

NVIDIA releases compact 4B parameter model combining Mamba and Transformer architectures for efficient local AI inference with 8K context support.

LLM Optimization

AirLLM: Running 70B Parameter Models on Consumer Laptops

A new library called AirLLM enables running massive 70B parameter AI models on old laptops with limited RAM by processing layers sequentially rather than loading entire models into memory.

LLM Optimization

Persistent Q4 KV Cache Enables Multi-Agent LLM on Edge

New research introduces quantized KV cache persistence for running multi-agent LLM systems on resource-constrained edge hardware, enabling local AI agents without cloud dependency.

AI Hardware

Taalas Hardwired AI Chips Hit 17K Tokens Per Second

Startup Taalas is challenging GPU dominance with hardwired AI chips designed specifically for inference, claiming 17,000 tokens per second throughput for ubiquitous AI deployment.

edge AI

HQP: Hybrid Quantization-Pruning for Edge AI Inference

New research combines sensitivity-aware quantization and pruning to enable ultra-low-latency AI inference on edge devices, potentially transforming how generative models deploy on mobile hardware.