INSIGHTS

8 min read

Blog thumbnail
Published on 05/30/2024
Last updated on 06/18/2024

AI infrastructure: How to prepare your organization for transformation

Share

Recent advancements in artificial intelligence (AI), including the development of open-source large language models (LLMs), are prompting organizations to accelerate their adoption of AI solutions. When aligned with your strategic goals, AI can give you a competitive edge, facilitating a more efficient and productive workforce and smarter risk mitigation. AI can enhance various business processes, such as marketing and customer experience, to drive revenue and boost market share.

Defining your transformation goals is a crucial first step in a comprehensive AI readiness assessment. AI development and maintenance is a huge investment, so choosing IT infrastructure that aligns with your strategic roadmap and technology gaps is critical to success.

For some companies, AI transformation means integrating a pre-built LLM, like ChatGPT or Mistral, into your operations. These models, in combination with autonomous agents, can perform general functions and connect to your existing systems through an application programming interface (API). Other businesses may take AI transformation further by fine-tuning LLMs and training them for custom applications. 

The former approach has relatively few new demands on an organization’s IT infrastructure besides additional security considerations. But LLM development, even when based on pre-built models, requires significant IT upgrades from what is typically needed for general-purpose computing. 

Access to Graphics Processing Units (GPUs) and other computing resources is of particular concern to those fine-tuning and deploying enterprise models. To ensure success, organizations must establish AI readiness with appropriate computing power, data management, and security.

AI infrastructure for transformation

Cisco’s AI Readiness Index has discovered that, while organizations are keen to leverage AI, many lag in areas like IT infrastructure and data governance. Nearly two-thirds of business leaders feel the urgency of AI adoption, expecting to face negative business consequences if they fail to transform in the next year. Yet less than a quarter of those surveyed say they have sufficient GPUs to support current and future workloads. Over half will need infrastructure upgrades to use more complex AI systems.

Infrastructure requirements for AI development are very different from those appropriate for general-purpose computing. Most enterprises will need major upgrades to their computing resources, data management, and security tools to fine-tune AI models offered by cloud providers. Beyond the infrastructure itself, you’ll also need to consider strategies for hiring or AI workforce development to acquire the skills to build and maintain these systems.

AI computing power and GPU resourcing

Sri Aradhyula, a Technical Lead at Outshift by Cisco, stresses the importance of securing GPU resources before engaging in AI development. “The biggest thing companies need to invest in is understanding their access to GPU machines that are high performing. That’s the biggest bottleneck,” he says.

AI workloads are compute-intensive and need high-power GPUs to operate. However, these resources are expensive and limited in availability, so organizations must be strategic about access. Large companies building foundational models dominate the market, buying up available capacity. This makes it difficult for smaller players to procure GPUs—a challenge exacerbated by supply chain issues. 

One strategy is to build your compute infrastructure independently by purchasing GPUs and the resources needed to house and maintain them. Alternatively, you could partner with cloud providers to access GPU instances. Both approaches have pros and cons. 

Aradhyula recommends performing a cost-benefit analysis to find the right balance. “Cloud providers take on the responsibility of scaling and managing GPUs and the cooling requirements for those racks. You pay on-demand—the tradeoff is cost and availability,” he says. 

Partnering with a cloud provider also makes sense for enterprises that want to access the fastest GPUs rather than purchasing hardware that can quickly become obsolete. 

“Let’s say you procure a few million dollars of GPUs—in a year or two, that investment could be completely outdated,” Aradhyula says. “Whereas, if a hardware vendor comes out with a new GPU, you can quickly pivot and start training your model or running your instance with the latest version.”

Capacity planning resources

Due to their cost and limited availability, optimizing your compute resources is as crucial to AI readiness as obtaining GPUs. Because model training happens in short bursts, a robust and automated Machine Learning Operations (MLOps) framework is necessary to help maximize your investment while managing costs. MLOps encompasses various strategies and systems that use efficient queueing systems and training schedules for model development.

Organizations with in-house infrastructure can control their schedules, but those relying on cloud providers must consider capacity planning. Cloud providers may lack the necessary capacity for model training or production, so it’s crucial for your organization to plan ahead and align product roadmaps with GPU availability.

According to Aradhyula, the best approach is to forecast your computing or traffic demand and extrapolate what you’ll need before talking to cloud partners. “If you’re going into production, you need multiple months of lead time on some of these high-capacity GPUs,” he says.

Accessing GPUs and optimizing AI computing power is a tricky balance, but one that will differentiate companies in the market. “Your speed to market is based on how fast your GPU is. If you have a faster GPU, you can launch your product faster than somebody else—whether it’s an AI product or a product built using AI.”

Data storage and engineering

Data readiness is another concern when transforming infrastructure for AI. In most organizations, data isn’t immediately available for fine-tuning AI models, because it isn’t properly stored, cleansed, or processed. Cisco’s AI Readiness Index found that data is still siloed for 81% of organizations, awaiting solutions like data mesh architecture to prepare it for model training.

AI model outputs are only as reliable as the data used to train them, so you need to ensure that these datasets are high-quality and relevant to your desired use cases. This can be accomplished by centralizing, processing, and cleansing data before it’s used for model training. Data mesh concepts are useful at this stage, helping remove silos and prepare data at scale for AI applications. You can also invest in validation systems that check data for accuracy, make corrections, or eliminate duplicate data to enable high-quality outputs. 

Additionally, businesses need to consider their approach to data storage and transfer. Unlike GPU resourcing, which remains an uphill battle for AI innovators, data engineering infrastructure is readily available—it just comes down to what best suits your organization. As with your compute resourcing strategy, perform a cost-benefit analysis to compare the value of on-premise or cloud provider-based storage solutions. 

Cloud providers typically have the cheapest options with different tiers to meet your requirements. For example, cold data services like Amazon Glacier can archive datasets that aren’t required for immediate use at an affordable price. When your model is ready for training, you can simply upgrade to a tier with more flexible access. 

Security infrastructure

According to Cisco’s AI Readiness Index, 39% of business leaders say they’re only moderately capable of handling AI security risks with their current infrastructure. Worryingly, a quarter of them aren’t versed in emerging threats that target AI software.

Understanding AI-specific risks and establishing appropriate safeguards are crucial steps for a responsible and trustworthy transformation. AI technologies introduce unique security challenges that differ from traditional network cybersecurity, necessitating new strategies to address them effectively.

For example, attackers can use prompt-based techniques to access sensitive training data without needing to directly compromise an AI system. In this case, enterprises must develop new prompt validation strategies to flag malicious user queries.

Verifying your cloud providers’ security infrastructure is also key to protecting your AI models and training data. Evaluate what security practices these companies follow, including how they secure APIs or whether they comply with industry standards like Service Organizational Control Type 2 (SOC 2).

Even if your cloud providers use sufficient security protocols, layering on additional safeguards will strengthen your security posture for added peace of mind. Consider developing or acquiring security software to address LLM-specific risks, such as prompt-based attacks. 

Plan your AI infrastructure now to stay ahead long-term 

Organizations can generate significant value by adapting pre-trained models for specialized use cases. However, tailoring these models is still compute-intensive and demands robust GPU capacity, even without developing an AI model from scratch.  

To ensure a successful transformation, evaluate your AI roadmap and understand your development team’s needs. Adopt a long-term perspective when making infrastructure decisions, choosing the resourcing strategy (in-house vs. cloud-based) that promises to produce the greatest cost-benefit ratio. 

Consider approaches that give you the flexibility to use the latest computing technologies, especially when time-to-market is an important differentiator with competitors. If this means using third-party cloud services, prioritize capacity planning in your partnerships.  

As AI adoption accelerates, pressing issues like GPU access will differentiate leading innovators from followers. While infrastructure requirements will look different for every organization, the best transformation strategy is the one that starts now. 

Ready for the next step? Learn more about data security solutions for AI transformation. 

Subscribe card background
Subscribe
Subscribe to
the Shift!

Get emerging insights on emerging technology straight to your inbox.

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.

thumbnail
I
Subscribe
Subscribe to
the Shift
!
Get
emerging insights
on emerging technology straight to your inbox.

The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.

Outshift Background