We also provide a new profiling tool to identify training performance bottlenecks. In this release, we introduce new compressed-training strategies to support fast and low-cost training while simultaneously delivering high accuracy. Together, DeepSpeed Inference shows 1.9–4.4x latency speedups and 3.4–6.2x throughput gain and cost reduction when compared with existing work.Īffordable, fast, and accurate training: Beyond inference, another key ask from DeepSpeed users is to reduce training time of large-scale models without adding additional hardware. Effective quantize-aware training allows users to easily quantize models that can efficiently execute with low-precision, such as 8-bit integer (INT8) instead of 32-bit floating point (FP32), leading to both memory savings and latency reduction without hurting accuracy.Inference-optimized CUDA kernels boost per-GPU efficiency by fully utilizing the GPU resources through deep fusion and novel kernel scheduling.Inference-adapted parallelism allows users to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost.Our new technologies for optimizing inference cost and latency include: To accommodate even bigger models, and to achieve faster and cheaper inference, we have added DeepSpeed Inference-with high-performance multi-GPU inferencing capabilities.ĭeepSpeed Inference at a glance: As requested by many users, DeepSpeed rolls out high-performance inference support for large Transformer-based models with billions of parameters, like those at the scale of Turing-NLG 17B and Open AI GPT-3 175B. For example, a single NVIDIA V100 Tensor Core GPU with 32 GB of memory can only fit up to a 10-billion-parameter model for inference, and the latency is limited by single GPU performance. Moreover, these models with tens or hundreds of billions of parameters, trained with aggregated memory from multiple GPUs, simply become too large to fit on a single GPU’s device memory for inference. Large-scale models are extremely computationally expensive and often too slow to respond in many practical scenarios. Two of the main challenges with inference include latency and cost. But inference, especially for large-scale models, like many aspects of deep learning, is not without its hurdles. One important aspect of large AI models is inference-using a trained AI model to make predictions against new data. As the DeepSpeed optimization library evolves, we are listening to the growing DeepSpeed community to learn how users are engaging with the library and to take on new frontiers to expand the capabilities of DeepSpeed. In addition to creating optimizations for scale, our team strives to introduce features that also improve speed, cost, and usability. Last month, the DeepSpeed Team announced ZeRO-Infinity, a step forward in training models with tens of trillions of parameters.