How is AI changing datacenter network fabrics?



EXPLAINER AI workloads are overwhelming the traditional datacenter fabrics they run on. Here’s what’s replacing them.

The network has become the nervous system of any organization running AI at scale. A single distributed training run can chew through thousands of GPUs for weeks, and one congested uplink can slash throughput by more than 30 percent. Plenty of datacenter networks aren’t keeping up now that AI sits at the center of the workload.

The traffic itself has changed. China’s daily AI token consumption jumped from roughly 100 billion a day at the start of 2024 to more than 30 trillion by mid-2025. That’s a 300-fold rise in 18 months, by official count. Machines do more of the talking now, too: bots and agents make up 51 percent of internet traffic, outnumbering humans for the first time in a decade.

These pressures are forcing companies to change the way they think about datacenter network fabrics.

What is a network fabric?

A fabric is the dense mesh of switches and links that carries traffic between servers (east-west) and in and out of the datacenter (north-south). AI training traffic moves east-west: GPUs in a training cluster swap enormous volumes of data, and a single lossy link can stall the whole job. But AI inference generates new flows north-south and over the WAN.

Why does automating it need a model of itself?

This is where most automation falls down. You can script a change onto a fabric, but a script has no idea how the parts relate, so it can fix one switch and quietly break its neighbor.

The answer is to give the network a model of itself. HPE Networking Apstra Datacenter Director, for instance, keeps a live graph of every device and the links and policies between them. Intent-based networking runs on top: an operator declares the outcome they want, and the system writes the configuration, checks it against the graph before anything ships, then continuously verifies that the running network still matches. A deterministic model can point at the real root cause instead of burying the operator in alarms.

How does the fabric catch trouble before users do?

Telemetry is only part of the solution. Add in AIOps to optimize the experience of the network operator and the application end user. Instead of asking whether a switch is up, newer systems ask whether users are getting a good experience, then trace a slowdown to the exact port or optic behind it. HPE Mist Networking Datacenter Assurance scores fabric health on that basis. Operators query the Marvis AI Assistant, which has evolved into a reasoning agent, providing a simple way to interact with the network. Predictive models go earlier still: by watching voltage, temperature, laser readings and CRC error counts, they flag a failing optic before it drops a link. The operator hears about it first, not the user.

What about security?

None of this holds if security is an afterthought. AI pipelines move sensitive data east-west, between servers, where a perimeter firewall never looks. So segmentation moves inside the fabric. Workloads are walled off from each other and every flow is inspected, not just the ones crossing the edge.

What does it mean for your team?

Get the rebuild right and fewer change windows die on a typo, while root-cause analysis stops being an archaeology dig. Network engineers don’t vanish, but they do evolve from herding switches in the CLI to designing the fabric,handling the exceptions automation throws up, and getting time back to work on strategic initiatives important to the CIO.

The datacenter is the substrate the rest of the AI stack runs on; build it to run itself and everything above it gets easier.

Sponsored by HPE.



Source link