Where are Full-stack ML Systems Going?

For a roundtable on how AI hardware, software, and infrastructure will evolve in the years to come, we brought together two current SPC members – Bill Chang, who built Tesla’s Dojo supercomputer, and Ravi Jain, who led strategy & business for Krutrim – and Max Ryabinin, a researcher at Together AI.

Gopal: There’s a bunch of interesting ways all three of your backgrounds overlap, from large-scale training to BigScience to multilingual LLMs. To start with you, Bill – Tesla's model of both training and deploying large models is quite unusual. Most people don’t have massive supercomputers like the Dojo cluster and also don’t get to run inference in cars. Can you give us the 10,000 foot view of how it all works? How has it changed?

Bill: At a very high level, Dojo was created when things were fairly new. At that time, the state of the art was convolutional neural networks (CNNs). LLMs didn't exist. From first principles, we set out to build the best training platform we could. There's two parts to it – one is the architecture and the other is the technology. The Dojo team is trying to build the best — and I encompass "best" in a lot of ways — computer to do training at a large scale.

Dojo fits into the larger picture as a different training architecture. It’s similar to trying to port your software from, let's say, an NVIDIA platform to a TPU, or to a Groq chip, or something like that. It has a very different compute architecture. It relies on a compiler to translate something done in PyTorch down into the hardware. And it is fairly large scale — maybe not in terms of, say, a Colossus from xAI — but it's a very large scale in the grand scheme of things.

It excels at vision-based workloads, which is what it was created for. If you look at Tesla's pipeline for training as a whole, it's quite large – and Dojo fits well into that. That team is still pushing really hard in terms of continuing to build the best kind of training computer you possibly can, from a cost, performance, and power perspective.

Dojo was built in a way that it is general-purpose, so, now, with some different architectures coming out, you can reconfigure it so that it can be tuned to a different workload. That's my expectation: you'll see the building blocks remain the same, and then we’ll reconfigure it to maybe support transformers or something else in between. That could mean configuring a version that might have more memory, changing that ratio, or something like that.

Gopal: Max, given your research at Together, what are the most interesting questions – over the next 3-6 months – that you think will be top-of-mind?

Max: I am still quite interested in seeing how the general understanding of LLM evaluation evolves over time — regardless of whether people are training their own models or trying some additional improvements on top of them. And we've done some research in this area earlier this year, which is about understanding the robustness of latest models to minor variations in the prompt. That's just one example of what we can study as phenomena that are still present, even in today's most powerful models.

Another direction is training models across heterogeneous or "disaggregated" hardware: there are a lot of research topics which are not yet fully explored, to my understanding, because previously there were just a couple of research groups all over the world who were studying these topics. Right now, there are a bit more — so maybe not two, but four or five — but still quite a small number.

We're seeing increased interest in doing cross-data-center and decentralized training (but maybe not in a less “collaborative” sense, compared to our prior research), which comes with a lot of challenges that are important both in HPC settings and outside of them.

There are lots of questions around the communication efficiency of distributed training procedures, about the algorithmic efficiency of training—which is something that I believe Yaroslav [another SPC member] is also quite excited about—and lots of other questions related to the topic of whether we can actually keep scaling training within large data centers if we want to obtain the best and the largest possible models.

Gopal: What is the garden of forking paths when it comes to the actual workloads that future data centers enable?

Maybe there’s one world where things go the same way they’ve been. There’s another world where things diverge into many different ML datacenters and clouds.

Ravi: Obviously, there will be the usual purely CPU-utilizing workloads that exist today — web, streaming data, data compression and so forth. But in our analysis, our assumption was that there will be two or three dominant workloads that are more AI-specific. For example, multimodal models that will be used across robotics and LLMs. There will also be many HPC workloads that continue to become more sophisticated, for climate modeling and other kinds of weather models, predictions, simulations on material science, etc.

In fact, the hypothesis would be that, on the model side, there will be a bimodal distribution of models. Maybe one very large, envelope-pushing model like what OpenAI and others are working on, and then a bunch of small models used by enterprises for their specific vertical workloads.

The nature of cloud infrastructure to be built is all software-defined or model-defined, in a way. That is, what kinds of workloads will exist, and, for each of these workloads, what kind of infrastructure is needed? And, is there a way to create some sort of optimal lowest common denominator across this distribution? Is there a way to layer this with more specialized compute for the niche/long tail of workloads?

Our thought process [at Krutrim], when we were researching the cloud infrastructure of the future, was tied to predicting what the future workloads would be. Some workloads are very compute-intensive, some are constrained by interconnect speed, and some require a very large scale-out so these require that GPU no. 1 talks seamlessly to GPU no. 1,000. All of those considerations point to different types of architectures on the datacenter side.

If you look at the large multimodal LLMs, you automatically gravitate towards a Blackwell-type architecture, where all the racks are interconnected and, for the software engineer, it is one system. The software sees all of those thousands of GPUs as one system seamlessly talking to each other. This is how a workload completely redefines the architecture on the cloud side.

Then there are considerations on the energy side — how to make this efficient so it’s not too power-consuming and remains economically viable, especially in the developing economy context. Multiple factors come into play, but it might boil down to: what is the workload distribution you expect the datacenter to serve and how can you do so economically?

That was the way Krutrim was thinking about it, which is different from, say, Dojo, because the predominant workload there would be the autonomous training stack, which is not as general as when you're building a public cloud. Does that make sense?

Max: Definitely does. I actually had very similar thoughts on how data centers are constructed in light of the future workloads we have to launch. From a historical perspective, there's always been this interesting interplay between how software develops and how it defines the hardware it needs, and in the other direction, how the hardware available shapes the workloads best suited to it.

One of the best examples is the entire deep learning revolution. We wouldn’t have had such an explosion of different approaches in deep learning if not for the availability of devices that could do matrix multiplications quickly. Going forward, I think we’ll see increased cooperation between these two fields.

One direction is algorithmic optimizations that consider system constraints. For example, FlashAttention does that. Another is about taking existing computational problems and thinking about hardware designs that can address those problems in the most efficient, maybe even energy-efficient, way possible.

A good point you made, Ravi, is about device interconnect. This isn’t something you always need. For example, if you just have a small model, then maybe a tiny, cheap GPU or an accelerator without interconnect could be fine. But if you want something truly large-scale, then you need those interconnects.

Ravi: To add to this, I think about training versus inference. Inference of even very large models does not require – like you said – very high-speed interconnect. A few nodes of GPUs can handle inference for a large model. But the moment you think about creating a cloud for both training and inference, it’s different. Training will have very different interconnect requirements.

If you look at Groq, for example, they’re laser-focused on inference. They’re a very good example of a software-defined inference architecture, and they can manage very good latency on somewhat older nodes compared to some of the most advanced ones. These are examples where you define your workload first and then go back to fundamentals to design the best architecture for it.

Bill: Yeah, to jump in, and I think about things both from a software and a hardware level. There are many possible paths, but, fundamentally, AI has finally changed the workload and how things run on the software side.

You can see two things happening: 1) the state-of-the-art continues to grow and expands to larger and larger workloads and 2) continued effort to build more and more efficient models — to get the performance of state-of-the-art models in smaller, more efficient forms. How successful these two efforts are will dictate where data centers ultimately go. But in either case, the fundamental shift is in the workload versus the hardware.

If you look at how clouds work today, you can build a data center that virtualizes the workload because workloads can be partitioned and moved around easily, making them highly resilient and flexible. But as soon as the software and workload don’t fit onto a single node, virtualization becomes very difficult. You end up needing very large coherent systems, making it challenging to build a cloud on which you don’t care where things run.

For example, let’s say state-of-the-art inference still requires multiple nodes. Building large coherent systems, both at inference and training levels, is challenging. To get to a point where you have a multi-user, virtualized cloud that’s cheap, scalable, and reliable, you have a couple of options: 1) scale down, and try to bring as much as possible into a smaller piece, so the model fits into the size of a single node or 2) build technology that makes a node much more capable so multiple things can fit in it efficiently.

These approaches intersect with the push for more efficient models. Meanwhile, the state of the art is just going to continue growing as big as power and resources allow. This means data centers have to change.

We’ve already seen it: average rack power for CPUs used to be 10-15 kW per rack, and now GPUs are 50 kW or more. Air-cooled data centers become harder to build: you move from 10 MW data centers to 100 MW, and now you hear talk about gigawatt data centers. Everything’s fundamentally changing.

When you factor in multiple tenants with different requirements, you need reconfigurable network architectures. So it all looks very different from the ground up. But it’s possible we might see a split: if efficient models really take off, clouds will build for that and not necessarily chase the absolute state-of-the-art. That will look fundamentally different from a data center built to host massive, state-of-the-art models.

Gopal: What do you mean by “efficient models” in this context?

Bill: Let's say, for example, someone can get a 7B parameter model to you to reach the capabilities of a trillion-parameter model, and people add agents and reasoning to it. They pay for inference time and compute versus compute hardware, so to speak. And that’s another way to add multi-tenancy and keep your reliability up (or “blast radius” down), right? If one node goes down, you don’t want to take a thousand users down. You want to maybe only affect one user, or be able to move users to another node or something like that. Building a data center like that looks very different from building a state-of-the-art data center to run and train a trillion-parameter model.

Gopal: In a world where, say, five years from now, we see a dominant paradigm of efficient models with lots of inference-time compute – is that a meaningful change in how people think about computing?

Max: I'm happy to be proven wrong here, but to the best of my understanding, many inference-time compute techniques we know of, like chain-of-thought prompting, aren’t materially different from sampling longer predictions from the model with a particular prompt. This is obviously an oversimplification, but at the core, the problem feels very similar.

There are definitely interesting system optimizations to consider because you can include different steps in the inference pipeline. You could ask a model to operate as a sort of an agent, which might require it to do other things during the response, and there are more advanced techniques than just chain-of-thought that people are investigating. But at least in terms of the general pipeline of what gets done, I don’t see anything drastically different.

Bill: No, just to add to that: from a system standpoint, if we trade off model size with additional compute time, that’s actually a much better trade-off. If you increase model parameters, the system—the coherent system—has to grow. That means your “blast radius” grows. If you’re building a cloud and the overall integrated time is the same, it’s better to choose the path where you have the same amount of CPU or GPU hours deployed, but when one GPU goes down, you only take down that one node.

In the previous scenario, if you had a huge model spanning multiple nodes, maybe 10 nodes go down for one user. You’d rather have 10 users each on one node rather than one user spanning 10 nodes. That’s a better overall cloud deployment choice. In terms of total GPU hours, it’s the same. Let’s assume you’re making a choice between racing to the finish with a huge model or just using longer compute time with a smaller one. For a large-scale deployment, I’d rather go for the smaller, more flexible approach. At least that’s my view. So if we continue to push in that direction, it’s an “easier” problem when it comes to building scalable, more reliable systems. I’m not sure if you can make that tradeoff in training, but at least with inference, you can.

Ravi: Yeah, I agree. On the use-case side, there are workloads that don’t require real-time latency. For example, very large summarization tasks or insurance case disposals using generative AI applications that you might run in large batches overnight. Those don’t need ultra-low latency. In contrast, a sales call powered by AI is real-time—you're speaking as an AI to a consumer. So one way people think about it is real-time vs. batch. If those tasks go to different types of clouds within the overall cloud, that can be more efficient. Some applications might lend themselves to making that trade-off, while things like chat and sales need real-time responses and can’t make that latency trade-off.

Bill: Also, building the smaller, longer compute-time models always wins out on cost, because you’re not paying for that huge interconnect bandwidth and so forth.

Ravi: Yes, exactly. Another aspect is on the application side, making the applications themselves more efficient. Especially when you think about multiple agents helping to solve the problem. The purpose of very large models has generally been that they can do all the steps, right? But if you use the same large model even for simpler tasks, it may be overkill. In a chat application, for example, you want a model to check for profanity, another to fact-check, and then a third to produce the full answer. If all of these requirements invoke very large models, that’s inefficient.

Ideally, you’d have multiple models (and some very distilled versions of large models) working in conjunction with the actual large model. An orchestration layer, which is perhaps another model, can help optimize and use this combination of models to solve a complex problem. Max might be an expert on this, since your team at Together may be working on this. It’s one approach I’ve found effective when you try to tinker and create more efficient applications. Eventually you converge on these kinds of optimizations.

Max: Yeah, people are definitely building systems with some degrees of hardware orchestration and fault tolerance already. Large GPU deployments are very hard to manage. The question is whether software systems and our methods can evolve to the point where we can use these systems in a more reliable way. Instead of, as Bill mentioned, taking out the entire GPU fleet or stopping service to tens of thousands of users because one GPU failed, maybe we can do something different. But what needs to change to make that really feasible?

I also wanted to add to Bill’s mention of models evolving in two directions: larger models vs. smaller, more efficient models. I don’t think these directions necessarily conflict. Since more than five years ago, we’ve seen the benefits of model distillation. You can train a large, fully capable model and then use its knowledge to train smaller, specialized models for particular domains. If that approach remains relevant, then we might see big models used to create smaller and more efficient ones. Some smaller models are already built this way—NVIDIA, Microsoft, and others have shown that you can achieve good results with these techniques.

Gopal: Ravi and Bill – you’ve both worked with some of the most unique founders alive in Bhavish with Ola, Ola Electric, and Krutrim and, obviously, Elon with Tesla, xAI, and SpaceX. What stories will you remember from building ambitious projects with such iconoclastic people?

Ravi: Bhavish had been thinking about doing something on computing for the last few years. He had the vision in pieces. And then, there are moments that trigger action. Someone from one of the larger semiconductor companies remarked that they refer to India as a “1% market.” If 100 servers are shipped globally, India accounts for one — and yet we’re 20% of the population.

We started thinking about why cloud computing is so under-penetrated in India on a per-capita level. That first-principles thought process — why are we only 1% for 20% of the population? — led to the inspiration of creating the company. Solving for cost-efficient, performant compute can not only increase penetration in the short term but can lead to compounding benefits for creators who build on cloud in the long term. That approach, linked to one comment, will stick with me forever.

Bill: I have a bunch of stories, but there’s one that really sticks in my head. One thing I highly respect about Elon is that he’s basically the same person whether you see him in an interview or in a meeting. I respect that consistency.

I remembered a time where he told the autopilot team — “What we’re doing is super important. We have to bring this technology out to make things safer. But what this looks like is: if you save nine people’s lives, they never know it. The one you didn’t save will sue you. Don’t worry about what others say – just do the right thing.” He said, “Make your best call. Because whatever you do, people will say you’re wrong.” He said this in an interview a few months later. It stuck with me. It’s true — people judge you afterward, but all you can do at the time is what you think is right. Surround yourself with smart people, get different perspectives, but do the right thing.

Max: Since you mentioned BigScience earlier, I recall a time when we were working out in the open on large language model training. Stas Bekman, who was an engineer at Hugging Face, experimented a lot to make large models converge given the infrastructure we had at the time. We saw how some decisions in experiment setup caused divergences in the loss — instabilities that became big issues. Then, at some point, everything clicked and worked, maybe because we got hardware that supported BF16 precision or because we fixed some other detail. Stas documented all those findings in an open document, which I encourage everyone to read. That kind of experience is priceless.

Gopal: What other questions would you have asked each other if I weren’t here?

Bill: I would love to poke everyone’s brain more about the future. We have an immense opportunity where there’s a new application – AI – with immense demand. This is a once-in-a-lifetime chance to ask: how can we use technology to make things better, faster, cheaper? How do we build better systems for what’s coming?

There’s a chicken-and-egg issue — software guys can’t program what doesn’t exist. But this is unique: if something existed, we might build it. That’s super exciting. It’s fundamentally why I joined SPC, to see what’s out there and what we can do better.

Ravi: One thing, like Bill said, is the idea of world models — where you keep observing the world and the model figures out the physics behind it, with applications in robotics, etc. It seems obvious it’s going to happen, so maybe we should build datasets and infrastructure in anticipation. That’s exciting: building the building blocks now for something we know is coming.

Max: I would broaden that. We’re seeing lots of exciting new developments: world models, agentic interactions, inference-time compute. We need to consider if our systems and ways of thinking about ML are up to the challenge. There’s a lot unknown about how to implement these new paradigms efficiently and practically.

Ravi: Yes, and it requires a lot of courage to invest and build in anticipation of something that might come, rather than just improving what already exists. Some of the best leaders — Elon Musk, Jensen Huang — have done exactly that. They anticipated future needs and built toward them.

Gopal: On that thread of looking forward, at SPC, we don’t have requests for startups; we have “axes of curiosity.” What are the threads you think are worth spending a decade on?

Ravi: I’m not an expert in this domain, but one axis of curiosity is: what is the next leap in computing? Some say quantum, others say different paradigms. Relatedly, the human brain is very energy-efficient. It’s not just bigger compute = better. Maybe we need a different methodology.

Bill: There are so many directions. One I think about a lot is how we solve reasoning. CNNs model the optic nerve and occipital lobe; we solved vision. Transformers, maybe, are similar to the hippocampus. The largest portion, by volume, in the human brain is the frontal cortex for reasoning. We haven’t tackled that. By analogy, we have a long way to go. It’s a fascinating time.

Gopal: I completely agree, and I think that’s a great place to wrap. I appreciate all of your time, thoughtfulness, and curiosity. This is the most precious asset we have — time with people who are curious and excited to learn. Thanks for spending together.