Serge Lucio is the VP and GM of Agile Operations Division, Broadcom Inc.

AI models are rapidly increasing in complexity, demanding more powerful computing resources for effective training and inference. This trend has sparked significant interest in scaling computational capacity for AI, with teams exploring new hardware architectures and distributed computing strategies to extend the boundaries of what AI can achieve.

Beyond Moore’s Law

In the 1960s, a cofounder of Intel predicted that the number of transistors on a microchip would double about every two years. This principle, known as “Moore’s Law,” has held true for decades, with the industry continuing to produce new generations of faster, cheaper computing devices at a remarkably predictable pace.

Today, Moore’s Law continues to maintain its momentum, even while questions emerge about the physical limits of silicon-based semiconductor technology. On the other hand, the computing power needed for AI doubles every six months—outstripping the growth in chip capacity promised by Moore’s Law. Distributed computing appears to be the only viable option for addressing the explosive growth in resources needed for training and deploying machine learning models. Given all this, the growth in demand ushered in by AI has reignited interest in massively parallel infrastructures.

Scaling AI Isn’t A Computing Problem…

Dedicated hardware, like GPUs (graphics processing units) and TPUs (tensor processing units), has become essential for training AI models. Meanwhile, distributed computing infrastructures are being developed to enable more efficient interconnection between computing nodes. As AI models become increasingly complex, the demand for scalable and effective computing resources will only continue to rise.

…It’s A Network Problem

Computational power is undeniably critical for training and executing AI models, particularly those employed for deep learning. However, to perform efficiently and effectively, AI systems are also highly reliant upon the underlying network architecture and infrastructure.

As AI continues to advance, the true bottleneck often lies in the network. When AI data and workloads are distributed across numerous nodes, robust networking solutions can deliver the high bandwidth and low latency required. Innovations in networking technologies, such as high-speed interconnects and optimized communication protocols, can support the scale and speed necessary for modern AI applications.

The future of AI scalability hinges not just on more powerful processors but on the seamless integration of advanced networking infrastructure to enable efficient distributed computing. Here are some ways the network comes into play:

• Dataset Distribution: AI systems often require access to large volumes of data distributed across various sources. Efficient data distribution and access mechanisms, such as distributed storage systems and data caching, allow you to quickly feed data to AI models.

• Model Training: Training large AI models typically entails parallel processing across numerous computing nodes. It’s imperative to maintain efficient communication among these nodes to minimize overhead and ensure that the training process scales effectively, free from network latency or bandwidth limitations.

• Model Distribution And Inference: In scenarios where AI models are deployed across widely distributed environments, such as edge devices or cloud servers, effective distribution mechanisms are key. Low-latency networks allow many of the applications that require real-time inference to enable fast, AI-powered predictions and decision-making. This includes areas like autonomous vehicles, industrial automation systems and edge devices.

It’s Time For The NOC To Consider New Approaches

The demand for bandwidth isn’t new—network operations center (NOC) teams have long managed voice, video and other latency-sensitive applications. However, as organizations expand their AI models and applications, these teams must implement strategies to handle the new challenges that arise from the growing volumes of AI data moving from data centers to edge services and the cloud, such as:

• Rising Complexity: Teams struggle to gain visibility and correlation across multivendor, software-defined and physical networks.

• Increasing Alarm Noise: With more network components and limited personnel, there aren’t enough experts available to efficiently triage and troubleshoot network issues.

• Externally Managed Networks: Traditional network management tools may not offer the end-to-end coverage required. For example, these tools often lack insights into the internet service provider (ISP) and cloud service provider (CSP) networks that the users’ experience is now reliant upon.

AI Is Both The Problem And The Fix

New methodologies and advanced tools can help address the network management challenges introduced by the widespread adoption of AI-enabled services. Ironically, Gartner analysts forecast that by 2027, 90% of enterprises will use some AI functions to improve network operations. AI-enhanced network management solutions can provide simplified workflows and analytics that help the NOC team manage complex, multivendor environments, reduce alarm noise and gain end-to-end insights into network performance. To help ensure robust and reliable network operations and support the growing demand for modern AI applications and services, NOC teams will need to be AI-augmented.

When it comes to managing modern networks, AI undoubtedly holds promise—but the promise is only realized when advanced reasoning capabilities are effectively tailored to the networking domain. To be viable, solutions need to offer a combination of algorithms that leverage diverse intelligence, including faults, topology, configuration, performance, flows and network experience. With these capabilities, AI-enhanced solutions can deliver the insights that NOC teams need to enable the delivery of high-performance network connectivity across the digital infrastructure.

Drawing It All Together

Scaling AI isn’t merely a matter of increasing computational power—it’s fundamentally a network problem. Effective AI deployments require robust, high-speed networks that can handle the immense data traffic generated by model training and support the low latency demands of real-time applications. Because traditional network management tools are often insufficient to deal with the complexities introduced by industrializing AI, NOC teams will want to consider new solutions.

Indeed, AI-powered network observability solutions can address AI’s challenges, simplifying workflows and driving greater efficiency. However, to successfully integrate AI into their environments, NOC teams need capabilities that are tailored to the specific nature of networking technologies. This tight integration can support the growing demands of AI applications and unlock new innovation opportunities in the digital era.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Share.
2024 © Network Today. All Rights Reserved.