Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
AI/ML
6 min read
Share
As artificial intelligence (AI) models, particularly large language models (LLMs), continue to grow in scale and sophistication, the infrastructure required to efficiently deploy and serve these models has become a critical focus. Optimizing the software and hardware stack for AI serving is essential for ensuring that AI systems can meet the demands of real-time applications, offer scalability, and deliver reliable performance across diverse environments.
At Cisco Research, we remain dedicated to advancing the field of AI infrastructure, with a focus on building systems that enable efficient and adaptable AI deployment. Our research efforts address the unique challenges that come with serving large-scale models, such as resource allocation, latency reduction, and ensuring seamless operation in multi-tenant environments. These initiatives aim to make AI services more accessible, reliable, and scalable in real-world applications.
Recently, we hosted the AI Serving Infrastructure Summit, where distinguished experts from the field shared their latest research and innovations on AI model deployment, spanning both software and hardware considerations. The summit featured insightful presentations on cutting-edge techniques for optimizing model performance, managing resource usage, and handling the complexities of multi-tenant environments. Below, we summarize the key takeaways from this engaging and informative event.
Aditya Akella from the University of Texas at Austin, presented innovations in adaptive inference systems for LLMs to address the growing challenges of speed, accuracy, and efficiency. Traditional inference systems often struggle with unpredictable workloads and rigid resource allocation, which can lead to trade-offs between accuracy and resource usage. To overcome this, Akella’s team developed Dual State Linear Attention (DSLA), which converts self-attention layers into more efficient linear attention layers during inference. This dynamic approach reduces memory usage and enhances performance, especially for long-context tasks.
In addition, Akella’s research explores optimizing input processing in multimodal models, allowing systems to selectively process different input modalities to achieve high accuracy while reducing latency. By combining these model-level and input-level innovations, Akella's work introduces greater flexibility to LLM inference, enabling systems to adapt to varying resource constraints and scale effectively across diverse applications, from healthcare to automation.
Rachee Singh from Cornell University introduced, "Morphlux," a system designed to address the growing challenge of interconnect bandwidth limitations in multi-accelerator servers used for machine learning (ML). She proposed replacing traditional electrical GPU-to-GPU connections with optical interconnects, enabling higher bandwidth, more efficient resource allocation, and reduced idle time between accelerators.
By leveraging programmable photonic fabrics, Morphlux allows for dynamic, contention-free communication between accelerators, unlocking up to 66% better bandwidth and up to 80% faster collective communication. Additionally, the system can replace failed accelerators in-place within seconds, ensuring continuous operations without job interruption.
Through experimental results, she demonstrated the significant benefits of this optical-based approach. In tests involving the Llama 3.2 model, Morphlux achieved a 70% speedup in training epochs by optimizing bandwidth utilization. The system also showcased its ability to quickly recover from GPU failures, maintaining throughput and avoiding the need for time-consuming job migrations. Singh emphasized that, while the hardware is still under development, Morphlux represents a promising solution for improving efficiency and fault tolerance in large-scale machine learning deployments.
Alexey Tumanov from the Georgia Institute of Technology focused on addressing the trade-off between high throughput and low latency in LLM inference systems. He explained that while the prefill phase could be efficiently batched, the decoding phase posed a greater challenge due to its sequential nature, leading to poor GPU utilization and high tail latency. Tumanov explored various approaches, including mixed batching, where prefill and decode operations were processed together, but this introduced latency costs.
His solution, Sarathi Serve, leveraged a fine-grained batching technique that split large prefill operations into smaller chunks, which were batched with decoding operations in a Service Level Objective (SLO)-aware fashion. This approach optimized the balance between memory and compute resources, achieving up to 6x higher throughput while maintaining strict latency constraints.
The system was evaluated on Llama 270B and Falcon 180B models, showing improved GPU utilization and smoother performance compared to existing systems. The solution had been adopted in real-world applications, including in vLLM and by several startups. Tumanov concluded that Sarathi Serve effectively solved the long-standing problem of optimizing both throughput and latency in LLM inference, with potential for wider industry adoption, particularly in collaboration with companies like Cisco.
Neeraja Yadwadkar, an assistant professor at UT Austin, began by highlighting the growing importance of LLMs and the need for efficient, accessible serving systems. As LLMs are increasingly used across various applications, from chatbots to content generation, Yadwadkar emphasized the challenges in deploying these models in real-world, multi-tenant environments, where workload variability and resource constraints require dynamic, cost-effective solutions. To address these challenges, she introduced iServe, an intent-based inference serving system designed to optimize LLM deployment without requiring specialized expertise from users.
One key innovation in iServe is fingerprint-based profiling, which captures only the unique layers of a model to estimate performance and memory usage accurately, significantly reducing the computational cost of profiling. Additionally, the system incorporates dynamic resource allocation, using a hybrid strategy to deploy models across available GPUs based on real-time workload, balancing resource packing and load distribution.
Yadwadkar demonstrated that iServe outperforms existing systems like Accelerate, AlpaServe and TensorRT in metrics such as latency, throughput, and resource utilization, while also meeting user-defined goals like minimizing latency or reducing costs. With an average error margin of less than 7% in resource estimation, iServe offers a scalable and modular framework, with potential for further innovation in handling heterogeneous GPU clusters. This work is part of an ongoing project, with plans for continued development and improvements.
My top 5 takeaways from the summit:
In summary, the speakers collectively highlight the importance of efficient model serving systems, the balance between performance and resource usage, and the need for more adaptable, user-centric solutions that can dynamically adjust to changing workloads in multi-tenant environments.
Subscribe to Outshift’s YouTube channel for more featured Cisco Research summits.
Get emerging insights on innovative technology straight to your inbox.
Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.
* No email required
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.