Transformation as Strategy: Smoothing the Surge: Powering the Next Generation of AI Data Centers

Show notes

AI data centers are redefining the limits of energy demand—and exposing new vulnerabilities in how power is generated, managed, and stabilized. As AI training workloads create highly variable, high-intensity power profiles, traditional infrastructure is struggling to keep pace.
In this episode, Spencer Gore explores why these “AI factories” behave fundamentally differently from conventional data centers, how their unique load patterns create challenges for utilities and on-site generation alike, and why energy storage is emerging as a critical enabler of growth. From rapid power ramping and checkpoint-driven fluctuations to evolving grid requirements and the rise of battery solutions, the conversation unpacks both the risks and the massive opportunities ahead.
For energy providers, battery innovators, and data center operators, the message is clear: solving power variability is no longer optional—it’s the key to unlocking scalable AI infrastructure.

Show transcript

00:00:06: Hello, everyone.

00:00:06: And welcome to.

00:00:07: Transformation is Strategy a podcast brought you by the experts at Rollin' Burger Americas where we explore fresh perspectives and practical strategies to help business owners create a lasting competitive edge.

00:00:18: I'm your host Mackenzie Puddesi.

00:00:20: The topic for this episode is energy storage at AI data centers.

00:00:24: Joining me today's conversation is Spencer Gore partner at Rollen Burger.

00:00:28: Hello, Spencer Welcome to the show.

00:00:30: Hey McKenzie Thanks For Having Me.

00:00:33: Yeah,

00:00:33: no my pleasure.

00:00:34: I'm excited to dig into this topic with you.

00:00:37: So Spencer it seems like every day a new AI data centers announced and i want To ask You how are these ai factories different from traditional cloud data centers?

00:00:46: And How do These differences change the demand for power?

00:00:49: Absolutely It's A great question.

00:00:51: so first of all its important Distinguished.

00:00:52: there There Are two Different Types Of Ai Data Centers.

00:00:56: We have Facilities That Are Predominantly Focused on Training New Models inference.

00:01:04: Inference is the service that when you use chat GPT, your calling on some servers to perform inference in a model it's already been trained.

00:01:12: so start with training.

00:01:14: The AI training workload Is unlike anything else?

00:01:17: That we've ever deployed at scale.

00:01:19: What's essentially happening When You train A new large language Model is that you're taking hundreds of terabytes of data, breaking it apart into small chunks feeding it out to hundreds of thousands of GPUs which then perform simultaneous calculations.

00:01:33: To try to slightly tweak the weights in a model what looks like in practice as GPU's going through sort-of breathing cycle.

00:01:42: so first they perform let's call forward pass calculation generating some predictions and then performed backward past which compute some gradients on how do improve And then they go through what's called an optimizer step, in which the gradients that calculated are shared with all of other GPUs.

00:01:59: They update their model weights and do it over again.

00:02:04: That cycle can take somewhere between maybe a few seconds or couple minutes.

00:02:09: It really depends on size The facilities.

00:02:13: power consumption changes dramatically during those different phases Because for example during the backward pass.

00:02:22: This is a very computationally heavy, you know, tensor math filled step.

00:02:28: but during the optimizer step it's your more memory bound.

00:02:32: and so if you actually watch The Power Consumption of the facility You'll see essentially follow kind of a sawtooth wave During training as It moves from one phase to the next.

00:02:41: now that duration Of these phases is you Know changes depending on the model architecture.

00:02:46: But uh...the short answer Is at AI Training creates highly synchronous variable power loads.

00:02:53: And that's the source of the problem, the batteries are being called on to solve AI inference On the other hand.

00:02:58: so once the model is already trained looks quite a bit more like traditional cloud computing.

00:03:03: load The GPUs or not acting all simultaneously They're serving in different prompts as they come In.

00:03:10: instead the variability Is little bit more Like you would expect for traditional cloud services.

00:03:15: So there might be peaks in middle Of day when everybody logged into Cloud code and then, you know, valleys overnight.

00:03:22: And so very different types of loads that ultimately drive different types or power challenges at the data center and then correspondingly have different requirements for energy storage.

00:03:35: Spencer perfect I think he really laid out clearly an I love even just visualizing those graphs in my head how to powers be consumed.

00:03:43: moving it sounds like there's inherent variability these workloads.

00:03:47: so beyond Are there other power challenges faced by training data centers?

00:03:53: Absolutely.

00:03:54: So, There's the source of variability that we just discussed which is every time as a forward or backward pass in the model weights The data center power consumption changes but they're also effects That occur on longer timescales for example at the end usually At the end of every Training epic that might happen every couple of hours.

00:04:17: traditionally the model saves its most recent weights to disk.

00:04:22: This is called a checkpoint, and during that check point when you're transferring sometimes hundreds of terabytes to disc The GPUs might be inactive just as they are waiting for their next epic start.

00:04:36: Now this isn't all data centers but some of them are architected this way.

00:04:41: And if the datacenters power consumption drops from say one gigawatt four hundred megawatts for the period of several minutes as that checkpoint takes place Well, that's a loss of several hundred megawatt.

00:04:56: That has either been passed through to the grid Which we can get into why?

00:05:01: That's a problem in just a little bit or it's been pass-through on site generation which Can also create other problems.

00:05:07: so there's essentially and there's an infinite ramp rate A negative ramp rate that can occur during checkpoints.

00:05:15: On the flip side The reason that you're saving these checkpoints is because when you have large numbers of GPUs, the statistical likelihood of a hardware failure is high.

00:05:27: And so whenever single GPU crashes during training run and need to stop reload most recent checkpoint continue on.

00:05:37: it might take ten minutes or more.

00:05:41: And during that time, again the data center's power consumption has fallen below its steady state average which can lead to problems with the ramp rate requirements you have agreed on.

00:05:52: Checkpointing both active saving and restoring can create challenges.

00:05:57: That is top of the challenge.

00:05:58: just starting a training run.

00:06:01: so if your initializing from idle of your utility is only so fast.

00:06:11: Again, this is usually prescribed in your interconnection agreement.

00:06:13: Suspenser other than batteries and energy storage.

00:06:16: what countermeasures are available to data center operators today?

00:06:20: So we have a small collection of counter measures today that are somewhat effective but come with real compromises.

00:06:26: they involve either reducing the performance at the GPU during ramp up-and-ramp down periods essentially modulating the clock speed to smooth power consumption.

00:06:35: This slows down training run.

00:06:37: Or you can create effectively fake work during the downtime, multiplying zero times zero billion times to keep the GPU's power consumption consistently high.

00:06:47: The problem with that countermeasure is it increases the power consumption of this facility materially and also makes the GPU run hotter which shortens their lifetime.

00:06:55: So best possible solution here Is an energy storage device That can simply hold on when the GPU doesn't need it And then dispatch at its time.

00:07:06: And that really is why we're talking about batteries.

00:07:10: That's fascinating and a real problem, it sounds like this variability is sort of fundamental to AI training.

00:07:16: so you know we've got some things now we can do.

00:07:19: but what do you expect for these load swings?

00:07:23: Are they going get better or worse in the future?

00:07:26: as we look at There are forces that are driving the trends in both directions.

00:07:30: So on the one hand, we have models that are getting larger greater parameter count and this tends to increase the duration of both a compute communication bound periods which means essentially time constants will stretch out as GPUs continue to get faster, memory bandwidth increases.

00:07:50: We'll see these periods start to shorten and so there's a bit of attention here between hardware and software improvements that will ultimately determine what these power loads are going look like.

00:07:59: at the same time we're seeing improvements in model architecture.

00:08:04: We're predominantly training mixture of experts models today, right?

00:08:08: And those present somewhat different power traces back to the facility depending on the parallelization technique that's being employed.

00:08:14: Finally as we start to become more intelligent about how we train these models for example performing the optimizer step during the backward pass therefore skipping the idle time completely or having checkpointless training some of these problems may actually go away which can have very significant benefits for the overall simplicity of the power architecture.

00:08:43: Spencer, I think you've done a really good job giving us a sense of

00:08:46: here

00:08:47: are GPUs?

00:08:48: Here's how training is working?

00:08:50: Here so energy consumption and storage is working today where it's

00:08:54: going?".

00:08:55: That gives us a good sense of what's going on in these data centers and these factories.

00:09:01: I'm curious to know how are utilities responding to these new variable workloads?

00:09:07: Yeah.

00:09:07: Well, it's a great question and we're just starting to see it happen now.

00:09:11: for many years Now data centers have been getting a free ride exporting their power variability challenges back to the grid which has A very significant but finite amount of inertia to handle.

00:09:22: handled our variability We're starting to say utilities insist on a few things.

00:09:28: first of all they're beginning to impose ramp rate requirements.

00:09:31: so in order to achieve an interconnection you cannot ramp more than ten megawatts, twenty megawats thirty megawat per minute for example.

00:09:39: Contrast this with what is physically possible in a data center which you can ramp a gigawatt and microsecond.

00:09:45: um.

00:09:46: So that's the first driver from The Need For Energy Storage or uh Ramp Rate Requirements.

00:09:51: Second limitations around passing back low frequency ripple These oscillations that we were discussing before every five, ten seconds switching from maximum to minimum power at a data center.

00:10:09: It turns out the not only can these resonate with mechanical rotating machines generators that are attached to the grid but they could also cause interzonal oscillation so frequency and voltage excursions happening due to resonance along long transmission lines between two zones in the grid.

00:10:28: So passing this back above a certain amount to the grid can actually destabilize entire zones and is not being allowed under interconnection agreements.

00:10:39: We're also seeing increasing requirements for what it's called fault ride-through, so when Low frequency or low voltage is seen on the grid.

00:10:49: Traditionally, a data center would trip run off of its battery backup and then wait for grid power to restore to it's normal voltage in frequency And then it would reattach itself.

00:10:58: The problem is if you have multiple gigawatts of datacenter load that are all attached at the same time and tripped simultaneously because of a grid fault Then That will actually cause a cascading destabilization of the grid as it can't react to that much load loss instantaneously.

00:11:15: And so data centers are increasingly being asked to ride through these grid faults, for example if the voltage drops ten percent they have to maintain ninety percent of their load on the grid.

00:11:26: and this is another place in which battery storage could be helpful because it can absorb you know ninety percent Of the data center's load.

00:11:34: If The Data Center has decided a trip over to backups There are a number of ways that utilities were starting to get smart and asking for more data centers looking at interconnect.

00:11:46: Thanks for clarifying, Spencer.

00:11:47: I feel like it's actually an interesting new challenge in the industry.

00:11:51: we hear about data centers bringing their own power.

00:11:54: does this change if they have on-site generation to skip utility completely?

00:12:01: When facilities choose bring their own powers expose themselves.

00:12:09: When these variable AI loads are passed back either to gas turbines or reciprocating engines, oftentimes they can excite resonance in the actual spinning machines themselves which could cause mechanical failure and shafts or bearings.

00:12:23: Or just for example on a reciprocated engine perhaps their response time is little bit faster.

00:12:28: Just the action of the governor trying to seek at different operating point every thirty-to forty seconds can cause massive engine hour accumulation.

00:12:37: So for those engines which are able to track that kind of variable load, there's significant fatigue.

00:12:44: And then they're a certain power generation setups who simply can't track the load.

00:12:47: For example, a fuel cell, which has time constant ten minutes or longer even when hot.

00:12:54: and so battery storage is really essential to perform that load smoothing function between data center in prime-power.

00:13:01: it actually serves more of a smootthing role when you integrate renewables on site as well.

00:13:07: So given that battery storage is definitely going to be an essential part of the future AI data center build-outs, how much storage are we talking about and where does it go?

00:13:17: Yeah.

00:13:18: It's a really important question.

00:13:20: And the truth is not just one answer because there isn't only one type of datacenter in this ecosystem.

00:13:26: We have training datacenters and inference datacenters some of them grid connected.

00:13:32: Some of them have their own onsite generation.

00:13:34: Some are incorporating renewables, some not.

00:13:37: Some grid-connected and have onsite prime power And they all have different reliability requirements and different power distribution architectures especially as we start to look at the early transition from AC distribution inside the data hall to DC distribution.

00:13:52: So right amount storage depends on a lot factors where the Prime Power is coming form The reliability And then also the types of restrictions related to ramping or intermittency that you need to accept in order to interconnect with power source.

00:14:12: But, in general there's a spectrum here so some data centers which will be able operate with minimal amount storage think perhaps five-to ten minutes worth.

00:14:24: It's just about the same amount that they've traditionally had in a UPS system, but that storage will have to be much more rate capable and we'll have to able cycle for many times.

00:14:36: There are other data centers which for reasons of integrating renewables or connecting on flexible interconnection may bring hours worth of backup battery.

00:14:50: And in fact, we see now some data centers which are bringing enough battery that they don't even need to have onsite diesel generators at all and this actually speeds up the construction of the data center.

00:15:00: They don't have to wait for scarce back-up diesel gensets and they don' t have to apply for air permits.

00:15:06: so there There are lots of benefits to bringing hours with a battery storage when the site can permit it.

00:15:13: Fascinating.

00:15:15: So I want to ask you a question, I'm sure is on many people's minds.

00:15:19: What would your advice be for battery companies looking to target the emerging market for AI data centers?

00:15:25: Yeah so i talked to a lot of battery companies that are placing significantly more attention on this market now than they were maybe six months ago and it... It's not hard to understand why because its growing more quickly then any other batteries segment And individual purchase orders can make a company's IPO.

00:15:42: An individual facility can have gigawatt hours of storage.

00:15:46: It's a highly competitive market, some of the best battery players are setting their attention on this market and so it's important to understand that.

00:15:57: but what's exciting about is there're many niches to fill because no two data centers exactly alike.

00:16:05: they all face slightly different power challenges.

00:16:13: Making a bespoke product for the right customer is an opportunity in this space, unlike in many others.

00:16:19: There are many different types of opportunities.

00:16:22: there's opportunity for energy dense backup that looks more like perhaps what was an intraditional UPS.

00:16:28: those opportunity for battery technology with extremely long cycle life That can be deployed to help perform the smoothing function.

00:16:37: There's opportunities for very low cost battery storage to help speed up time-to-power when interconnecting with renewables.

00:16:44: And then everything in between, so there is a tremendous amount of opportunity.

00:16:48: but I would say first know your customer understand the requirements and find you niche.

00:16:58: that where we've been fortunate do lot work players this space.

00:17:02: if thinking targeting data center market absolutely encourage you to reach out to us, reach out for discussion on how you can do it.

00:17:10: Spencer I think that is great.

00:17:12: we've dug into some very real challenges here but also discussed some interesting opportunities and so i think that's the perfect place to wrap up this conversation with today.

00:17:25: It has been really good to learn from your insights and strategies to help business owners navigate ever-evolving landscape.

00:17:33: So thanks again.

00:17:34: so much for your time.

00:17:36: Yeah, thanks.

00:17:37: it was great to be here

00:17:39: absolutely and for our listeners out there.

00:17:41: if you'd like to learn more about what Spencer in the team at Rollenberger Americas are working on visit www.rollenburger.com And If You're a fan of this show please share It with Your colleagues.

Show notes

Show transcript

New comment