
Justin Yi
Articles
-
Jul 22, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi
You can now swap LoRAs during LLM inference with TensorRT-LLM on Baseten. This means you can serve thousands of fine-tuned variants of an LLM from a single GPU while maintaining a low time to first token (TTFT) and a high tokens per second (TPS). Let’s say you’re building a product that requires many fine-tuned models, such as a support system with a specialized LLM per customer or a chatbot that’s adjusted to each user’s preferences.
-
Mar 14, 2024 |
baseten.co | Abu Qader |Pankaj Gupta |Justin Yi |Philip Kiely
Baseten has achieved best in class performance for key latency and throughput metrics as measured by independent researchers at Artificial Analysis. ✕The Artificial Analysis benchmark measures essential metrics for model performance:Time to first token (TTFT): The time from when a request is sent to the model to when the first token (or chunk) of output is received. Tokens per second (TPS): The average number of tokens per second received during the entire response.
-
Mar 14, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi
Quantization is the process of mapping model parameters from one data format (most commonly FP16 for LLMs) to a smaller data format, like INT8 or FP8. Quantizing a model offers faster, less expensive inference.
-
Feb 22, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Justin Yi |Timur Abishev
Using NVIDIA TensorRT, we were able to improve SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 Tensor Core GPUs. We achieved these performance gains by individually optimizing each component in the SDXL image generation pipeline. Performance gains are greater for higher step counts and more powerful GPUs, and the techniques used can be applied to similar image generation pipelines, including SDXL Turbo.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →