KubeCon Europe 2026 made AI inference its central focus with major CNCF donations including llm-d, Nvidia's GPU DRA driver ...
Red Hat is pushing Kubernetes inference into the mainstream by contributing llm-d to the CNCF, as enterprises race to run AI ...
The technique aims to ease GPU memory constraints that limit how enterprises scale AI inference and long-context applications ...
NVIDIA Boosts LLM Inference Performance With New TensorRT-LLM Software Library Your email has been sent As companies like d-Matrix squeeze into the lucrative artificial intelligence market with ...
Google has introduced TurboQuant, a compression algorithm that reduces large language model (LLM) memory usage by at least 6x ...
“Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually ...
The AI chip giant says the open-source software library, TensorRT-LLM, will double the H100’s performance for running inference on leading large language models when it comes out next month. Nvidia ...
When it comes to deploying local LLMs, many people may think that spending more money will deliver more performance, but it's far from reality.  That's ...
Forged in collaboration with founding contributors CoreWeave, Google Cloud, IBM Research and NVIDIA and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university ...