Data Platforms for the Next Decade

Data platforms are entering a new era. We sit down with Jeff, Neha, Tido, and Kushal to learn about scaling products to millions of customers (across both consumers and enterprises) and to explore how tomorrow’s engineers are thinking about building performant, complex systems.

Neha Narkhede: I'm the co-founder and board director at my first company, Confluent. I'm also founder and CEO at my second company, Oscilar, which makes fraud detection software. 

Kushal Khandelwal: I recently joined SPC to explore ideas in data infrastructure. Previously, I was leading the data and infra teams at JioCinema and Disney+ Hotstar, which are the largest sporting OTT applications in India. 

Jeff Rothschild: I've been working in storage and operating systems, computer networks, and a host of other areas as an engineer for about 50 years, and probably best known for being co-founder of Veritas Software. I was also the founding VP of Engineering at Facebook. Since then, I've been doing high-impact frontier investing. Basically, I do the irresponsible investments that the reasonable venture firms with good reason can't touch. 

Tido Carreiro: I was at Facebook, overlapping with Jeff a bit from 2008 to 2012. I worked with Aditya at Dropbox and then was VP of engineering and later Chief Product Officer at Segment, which was a customer data platform that we scaled to a couple hundred million in ARR and sold to Twilio. And now I'm the co-founder of a company called Koala. 

Gopal: Jeff, taking us all the way back to when you first joined Mark at Facebook, you were responsible for scaling one of the largest data ecosystems the world has ever seen. When you look back at the very early days, what were some of the core surprises that actually anchored your worldview? 

Jeff: The first learning was that perfection really is the enemy of getting it done. We needed to move fast. Facebook wasn't a sure thing. There were a lot of people moving in the marketplace. MySpace was a far more successful product than Facebook at the time. It had more than tenfold the user base. If it wasn't perfect, it was okay. That was the first lesson. 

I'd say the biggest learning though, over three years, was that it actually gets easier with scale. It's hard when you're dealing with a small number of resources and a highly variable demand and load. A lot of the challenges become easier when you have a bigger resource pool on which to build your service.

The law of large numbers is your friend. It gives you diversity both in location and in the type of equipment. You may even have time zone diversity where you're able to have isolation between portions of your fleet, which you don’t have when you're just starting out. That's probably a little bit less of a relevant learning today because these days, you would never start out building your own infrastructure from day one. You're going to use cloud providers, where you're able to exploit much of their scale and diversified deployment. 

I thought things would get more difficult as they got larger. Yet, time and time again, I just had to smile and say, “Wow, that was easy. Last year, that was a real challenge.”

Gopal: Everyone here has seen the early days scale into something quite large, from consumer to enterprise and infrastructure products. Why is it that certain problems become easier with scale? 

Tido: Even with Segment, which was maybe the next generation of cloud-based infrastructure, variable workloads were the source of basically all of the problems.

HBO was a customer and when Game of Thrones would air, they wanted to send every viewer through our system like a heartbeat. Even with cloud and auto load-balancing, those bursts were really challenging. We spent way more time figuring out how to respond. In some sense, the promise we had to customers was that they could throw anything our way and we’d count the API calls at the end of the contract and charge. That dynamic nature made it really hard, whereas I think when you get to scale, the scale evens you out. It was spiky individual customers where we spent almost all of our scaling engineering effort.

Kushal: I think the majority part of my time at Hotstar went into dealing with these spiky traffic patterns. The biggest challenge was that your business-as-usual (BAU) day and your event day look very different. On your BAU day, your traffic is more steady and predictable. The moment you enter a big event day, something like an IPL match or a key match for cricket, the traffic becomes very bursty and very spiky.

Within minutes, the traffic would grow by millions and that would add to a lot of data coming in which becomes very difficult to scale at the moment. While our ingestion systems might scale, the downstream systems can't scale up that quickly. Definitely with more resources, it becomes easier to handle those types of spikes.

Jeff: The TV show that caused us a bit of heartburn was Grey's Anatomy.

In 2005, we had a very homogeneous audience – largely East Coast and largely college students. When the first commercial would come on, everybody would be on the service but then immediately drop off again.

The homogeneous audience solved itself over time as the service was used by a broader swath of the population and in more geographies. We also found that the peak to trough – i.e., maximum usage point to minimum number of users on the site – was a range of a thousand. The minimum usage point of the day was 1000 times less than the maximum. That also creates very unique problems, both in terms of how you manage the physical infrastructure and also how you build your reporting and observability systems because almost anything that will drop by a thousand would normally be judged as an error.

Gopal: In some ways the build becomes somebody else's buy down the road, which was essential to the story of Confluent as well. Neha, at LinkedIn, before Apache Kafka and Confluent, what was your Grey's Anatomy sort of equivalent? What would become Confluent? Were there specific moments that caused that project to get spun up? 

Neha: For us, I think the big questions were: how do we commercialize Kafka without alienating the open source ecosystem? How are we going to build defensible IP while staying thought leaders in the Apache Kafka ecosystem? Should we start with a cloud native Kafka offering first? It wasn't very obvious when Confluent was founded in 2014.

It was clear to us after starting the Confluent cloud fully managed offering that this question got solved on its own. The question of IP mode goes away because there's no expectation of open sourcing a fully managed service. So, that became much easier. 

We went, famously, with the open core model where there were certain components that were open source with an Apache-like license and certain code that was closed source. All of that was fully managed in the Confluent cloud.

In retrospect, that decision turned out to be the right one because it allowed us to go into the enterprises on day one and land the Ciscos of the world. At the same time, we could launch the fully managed service at the right time because it takes about three to four years for a fully managed service to hit scale, especially if it's self-service.

Gopal: What were the other actually realistic paths that you could have gone down, aside from the one that you actually did? What were the open paths that were up for debate? 

Neha: Jay and I debated how to handle the tenancy model for the Confluent Cloud and a fully managed service.

On one hand building a multi-tenant, fully managed SaaS platform was much easier from an operational standpoint. Of course, there's work to do. It's easier to centralize updates, optimize resource usage, and scale out globally. But we also knew that security conscious, big enterprises — the financial institutions that became our customers — wouldn't accept anything short of a private VPC deployment where it's a whole new layer of complexity, building specialized toolings, operational processes, network architecture.

The main questions we wrestled with were: which tenancy model should we build first, and do we ever go down this private VPC route as well? When you see a lot of fully managed data platforms, some still don't provide this option at all. Ultimately, we recognized that if we wanted to serve highly regulated industries with strict compliance requirements, at some point we had to support this kind of private PPC deployment.

That trade off became one of the foundational decisions that shaped our product roadmap, engineering investments, go-to market strategy and so on. 

Gopal: On that thread of multi-tenancy, I think Segment in particular is a pretty fascinating story on how to make a fundamentally multi-tenant product operate at scale. Tido, from what Neha was mentioning, are there certain threads that come to mind when you think about Segment over the last eight years and the way data platforms are changing now? What principles carry through?

Tido: While making our decision, which I don't necessarily think was the right decision in retrospect, we were thinking about three different layers of the stack. There was a temptation to go further down into the true infrastructure layer, which I think would've been a good call.

We wanted to play at the middleware layer, which was this integration layer where we weren't actually owning the end user applications. We were making this bet on the importance of having a dozen or more best of breed tools and that owning the routing and plumbing layer was going to be powerful.

I think part of why Segment didn't get huge (it had a fine exit for sure) is that we were in this awkward place. We had some private VPC realm roadmap, but we weren't core enough infrastructure — the way Confluent is — that it quite made sense. We got a lot of people on paper interested as we would write the PRDs and co-design with them. But when push came to shove and they needed to go allocate two SREs to help manage the deployment, the interest would fade quickly. 

We realized they were buying us more because we're this sort of SaaS thing that you don't need to manage. What we learned later that really saved Segment from stalling out at maybe 50-60 million of ARR was actually going up to the application layer. We ended up solving a more vertical problem for a marketing leader at a B2C company that enabled us to sell a $20 million contract to Procter and Gamble.

 

Gopal: I'm curious, Jeff, about your perspective from almost the opposite side of the table. Ideally for a lot of the problems that you want to solve, buying a product might be a lot simpler than having to build everything from scratch at Facebook. At Facebook, how did you guys navigate this build versus buy decision?

Jeff: It's an interesting question because the answer has a lot to do with core versus non-core infrastructure. If your business depends on elements of infrastructure then you better understand it implicitly and be able to support it better than anyone else in the world. It doesn't matter how good a purchased product is. You're dependent upon someone else caring as much about the continuity of your business as you do, and that's unrealistic. So for core infrastructure, we always built it. 

Obviously we started with using  MySQL, memcache, PHP, but we very quickly developed expertise in each of those areas to the point where we had some of the best MySQL engineers on the planet working at Facebook.

We rewrote memcache top to bottom. It was a great tool when we started using it, but we made it far more precise. We were working down to microsecond traces to get absolute efficiency out of that stack. We studied it as if we had to get a PhD in the subject and ended up really intellectually owning our infrastructure. The company has never let go of that. But if something isn't core, if the continuity of the service is not dependent on a technology stack, our preference was to use technology available from the outside.

Open source is really orthogonal to that issue. Much of what we built Facebook on was open source. We were able to embrace and extend it, which was the right answer because we could then take all of our improvements and pass them back to the community. 

Tido: It’s funny because Segment at its best was core infrastructure for our biggest and most strategic customers. The limitation was that we were selling to marketing teams that could never hire the talent they needed to build the thing, which is another build versus buy consideration.

Gopal: Where do we think data platforms are going? Even just the nature of open source and commercial companies that help teams make full use of the assets are dramatically changing.

Kushal: Compared to the last 10 years, there has been a big conversation around data democratization. From a business insider or analytics perspective, the interface has never been clean. If you're coming from a non-tech background and don’t understand SQL, you end up talking to a lot of data analysts. I think this interface will evolve, with GenAI acting as that bridge layer because it enables natural language interaction that makes data truly democratized for the people. I think this interface evolution will happen in the coming decade, which would change the way we perceive data platforms. 

Tido: One of the major things we were coaching at Segment was standardizing and cleaning up data. The first thing people would do, once the price of storing data plummeted because of the cloud data warehouse, was just throw everything at the wall. And then they would come back a year later and have a bunch of stuff that may or may not be trustworthy. With AI, you can take all of this unstructured data and make it useful.

Jeff: I think that there really isn't going to be a data platform a decade from now where the primary mode of interaction is not going to be some form of chat or interactive interface. This has to be a focus for everyone today as it’s going to change the way that we interact with data. As Kushal was saying, it’s about democratization. You don't have to ask your question to an analyst who then figures how to extract that data from some arcane data lake going through seven versions of schemas.

Who is going to win in that space and how long will this all take? I don't know. 

Neha: I think it's also going to alter the entire data journey. GenAI and LLMs really shift that data journey from a purely structured SQL-centric pipeline to a more dynamic, unstructured vector driven process.

For instance, at ingestion, we will see a lot more text and image streams requiring all these on-the-fly transformations into vector embeddings. On the storage side, embedded indexes will emerge to handle these massive amounts of unstructured data, which is really enabling the high speed semantic search and context retrieval that LLMs are going to enable.

On the processing side, I think it'll involve very sophisticated orchestration pipelines that combine these traditional ETL processes with model tuning inference workflows, often requiring GPU accelerated compute. Real time and near real time inference. It also  will introduce new SLAs and architecture constraints around latency and throughput.

Gopal: With Oscilar, how are you thinking about building for that future in your own product? What is different about how you're setting up your architecture with a new business and setting up your process with Confluent?

Neha: To put things in context, Oscilar is an advanced AI powered fraud detection platform. AI processes and infrastructure is built from the ground up and is ingrained across the journey. From ingestion to storage to processing, everything is changing.

What we are really changing in this risk world is changing how quickly you can spot and respond to a new trend. The results of our research is showing that we can actually use LLMs to do anomaly detection. You need to think about model training and the imbalance that happens there. Building that kind of infrastructure and feedback loop — training and retraining using anomaly detection on the latest data patterns — is just one example of how deeply ingrained it is. On the consumption and inference side, it’s about building a natural language interface for root cause analysis and reasoning on the data — something that simply doesn’t exist today.

With LLMs, the power to reason is now available. You can ask questions like: what should I do with this? What are the trends? Why is this happening? We build things from the ground up — thinking about this world of LLMs and predictive machine learning models.

Kushal: The reasoning part is particularly very interesting. The promise of deeper easing capability would definitely unlock new value in data. As compute and storage costs go down, the amount of collection increases, but organizations tend to use a very limited set of data in their day-to-day businesses. Deeper analysis reasoning on previously unexplored data would be very powerful.

Gopal: A constant thread regarding AI is the clear difference between analysis and insight. Another topic that I want to tackle is just the world of synthetic data and where people think that's going. In some ways, Koala is a net new pool of data that a lot of people might not have thought about using. Tido, what are the other kinds of data, whether it's unstructured or structured, that will grow in importance?

Tido: So, we’re in sales tech. It's interesting if you think about just the role of a CRO at a B2B SaaS company. The most important thing they’re figuring out is how to hit a certain number this quarter and if they have all of the pipeline and sales process to meet this number.

They're taking reports and reading the opinions of everyone who enters it into the CRM and rolling that up often through four or five layers of management at larger companies. 

Whether they’re a self-service user or running a POC, you can see how engaged they are. The CRM paradigm itself is broken — though Salesforce’s integration lock-in means it won’t go away overnight. Still, the shift is clearly toward synthesizing all this data. For Koala, that means empowering the new SDR just out of school with access to the same data that, for the past decade, only the best SDRs have mastered.

Gopal:  The end consumer of data has always been models. On Netflix, it’s not a person recommending Love Island — it’s the algorithm. What’s new is the awareness that much of the data we create isn’t for humans at all. Call notes in Salesforce, for instance, may never be read by a manager but are consumed by algorithms. Kushal, how do you see generative models becoming first-class consumers of data, not just producers?

Kushal: I’ve been thinking about this a lot. The way we interact with services is changing — earlier it was clicks. Now it’s natural language through chatbots, voice assistants, and agents. That means interfaces and interactions will evolve. The web was built for humans, but agents don’t need a UI to operate. Applications may soon be designed for agent-to-service interactions. GenAI will likely become the default interface, changing how we handle data and the layers around it. Models may not care about clicks or page views but about entirely different kinds of signals. This space is still evolving, and I think we’ll see many more iterations in how we navigate the world with GenAI.

Neha: Generative models are increasingly becoming first-class consumers of data. They ingest partially processed inputs to adapt and fine-tune, while also producing new data — summaries, synthetic training sets, and more — that other systems consume. This closes the loop, with a single engine continuously refining its dataset. Practically, that means paying close attention to data lineage — tracking what’s model-generated versus external — to avoid hallucination issues, upgrading infrastructure to handle high-throughput unstructured data, and evolving governance to prevent models from degrading by retraining on their own outputs. Ultimately, models as both consumers and producers push us toward a more dynamic, iterative data architecture where real-time adaptation and continuous learning become the norm.

Gopal: Where do you see open source going now? What role do you think it'll play?

Neha: Of course, I should be a big supporter or hopeful, but I might take a contrary view on open source. It depends on who the users are. For core data infrastructure, developers expect it to be open source. There’s no other way to build trust. But when it comes to training methods, fine-tuning, and the value we’ve developed on top of models, users don’t expect that to be open. They just want the value delivered through an interface that solves real problems. So I don’t think open source makes sense there. On the infrastructure side though, like products solving the closed-loop data pipeline problem, developers may expect those to be open source.

Jeff: I agree with you, Neha. If you’re building infrastructure, you need to support it. Enterprise customers can live with less than 100% availability because their users are internal, but public-facing infrastructure demands self support — which usually pushes you toward open source. For systems that are offline and focused on optimization or revenue enhancement, open source matters less since the SLA is lower. You still have yesterday’s dataset and can work with your vendor to recover. One thing I’d add: in analytics, any company not integrating LLMs and generative AI — both in pipeline generation and in presentation — won’t survive the next five years. And beyond analytics, system management is another major opportunity. Observability, monitoring, and availability engineering will all be completely retooled. Today’s systems won’t look the same half a dozen years from now.

Gopal: Betting on the future, what emerging ideas interest you most? Neha mentioned continuous learning around closed-loop pipelines, and the growing importance of synthetic data. What other areas are you curious to explore in the next few years within this ecosystem?

Neha: A few trends could redefine the data platform landscape. Vector-native data stores optimized for embedding similarity will become essential for real-time AI. Specialized hardware will lower inference latency and training costs, driving broader adoption of ML in both centralized and edge environments. I’m also curious whether edge computing will mature, with more on-device intelligence reducing latency and bandwidth needs. And finally, new orchestration frameworks blending serverless and container-native tech could enable hyper-scaling and seamless data integration. The future data stack will be composable, intelligent, and capable of continuous real-time analytics by default.

Tido: To take this in a different direction, I’m fascinated by studies comparing doctor performance with AI assistance versus AI making the call end-to-end. I’m curious about how humanity creates more leverage and where that leads. This is several layers above the data platforms, but the change in daily work is already happening — we see it in the developer community with tools like Cursor, and I think it’ll become mainstream soon. What I keep thinking about is how we consume data: what happens if we can process 10x or 100x more, and how that shifts our sense of purpose and the kinds of jobs we do.

Jeff: In the short term, the impact may just be efficiency. A cynical example is healthcare organizations that use LLMs to generate protest letters to insurers, who in turn use LLMs to respond. Sometimes with no humans involved. Hopefully we will do better than that. But as a software engineer, I can see LLMs reasoning over complex systems far beyond what individuals can grasp today. Imagine monitoring systems that not only detect anomalies but also understand your entire codebase, trace how it evolved over time, and integrates that with real-time data. That’s not something site engineering teams can do now, but it’s reasonable to expect in the next few years. It could transform how we build and manage complex systems across development and operations.

Gopal: In papers like DeepSeek and other reasoning work, there’s this notion of language mixing that I find fascinating. Some see it as a deficiency — pure RL reasoning where midway through a chain the system switches from English to Hindi to French and back again. But in some ways it mirrors what great people do: the ability to switch languages inside an organization. What are the things we don’t yet know to ask for? 

Jeff: We have the opportunity to rethink how this work is done. And on DeepSeek — it wasn’t a deficiency that it mixed languages, it was our deficiency. We don’t understand those languages. In fact, researchers found that constraining the model to a single language actually hurt performance. So clearly we’re the problem here — I’m kidding, of course — but it’s an interesting observation.

Aditya: What’s most interesting is that with any new technology, the first instinct is to redo the old thing but better. Today’s leading companies are modeled on search and retrieval. Our mental model of LLMs is mostly asking them to fetch known knowledge. Very few applications are doing true reasoning, multi-hop inference. The real fascination is what they’ll be used for in ways we can’t yet conceive. And as an investor, that’s what makes it exciting to stick around the hoop and see what gets built.

Gopal: I’ve noticed in Bangalore there are already pockets of the future that people elsewhere don’t recognize as novel. Voice as a modality, for example, has been mainstream in India for over a decade, while in the U.S. it still feels new. Kushal, outside of your professional work at Hotstar, I’m curious — having lived across ecosystems, what other lessons or unevenly distributed pieces of the future do you see? What feels novel in the U.S. that’s long been normal in India?

Kushal: One big learning from working at Hotstar and JioCinema was building at “India scale” and reaching the unreached population. Unlike developed countries, the internet in India is patchy, especially in tier two and three cities, and device capabilities drop sharply. Interestingly, we built a streaming app for the JioPhone, a $10 device with just 32MB of RAM and a tiny processor. That’s where innovation in India is coming from. Understanding how the masses are accessing services and designing for real accessibility constraints. With weak keyboards and limited hardware, voice becomes powerful. If you can offload compute to the backend, it empowers people to do much more. Working within those infrastructure and demographic constraints creates a very different kind of experience.

Gopal: This connects to a broader point about why SPC and talent density matter. People building the future have a sharper sense of constraints because they hit them every day. Investors, as valuable as we are, don’t always see that reality map as clearly. So on that thread, what are the big constraints you keep running into as you build your businesses, which will inevitably be AI-native in some way?

Tido: One thing shifting rapidly is GenAI itself. We started experimenting with features six to nine months ago and weren’t impressed, so we focused back on the data platform. Revisiting two months ago, the outputs finally passed the “useful” test. And since then, the progress has been dizzying. The pace of improvement in speed, cost, and usefulness makes it hard to plan beyond three months. At the same time, you’re trying to project where things go at the application level — what workflows still matter if agents are 100x more capable at a tenth the price in just a couple years. It’s exciting, but the price-performance curve is rebooting every month.

Neha: I agree, and we’ve found the same. But architecturally, some things require a real overhaul — especially storage. Managing and indexing massive volumes of unstructured, high-dimensional data is essential for AI workflows, particularly in real-time fraud detection. Traditional stores like Kafka or ClickHouse aren’t always optimized, so we’re looking at vector databases for low-latency similarity search. On the processing side, batch engines like Spark and Flink remain useful for heavy data prep, but real-time inference needs event-driven orchestration and GPU-aware scheduling to hit strict latency targets. Real-time streams are no longer just for logs or lightweight analytics — they’re critical data feeds for refreshing models, solving data drift, and delivering inference in under 100ms. What we’re moving toward is end-to-end streaming architectures where ingestion, feature transformation, and model scoring all happen continuously rather than in batch windows.

Gopal: This question of continuous updating and batch processing is one that you've been spending a lot of time exploring, Kushal. I’m curious what your thoughts are.

Kushal: What I’ve been exploring is the time it takes to go from data to value — whether that value is an insight or something else. How quickly you can derive value from data becomes far more relevant and the architecture around it is evolving. One major bottleneck is data quality. LLMs can process a lot, but if the source data is poor, it’s garbage in, garbage out. So the real question is how to solve for quality. A lot of people talk about “shifting left” — improving quality at the point of generation. At some stage it’s not about volume; ROI comes from high-quality data. More data isn’t always better — better data is.

Tido: Slightly related — a big theme for us is ensuring SDRs have relevant, trustworthy context when reaching out to businesses. Historically, that data has been scattered. The more raw you can get, the better. You’d rather have a call transcript than a rep’s recollection, because the transcript is the truth. So the challenge is accessing and joining all that primary data into the right context object to power workflows. In go-to-market tech especially, joining data and accessing raw transcripts is still a major problem. We’ve been focused on data collection and unification, and it feels increasingly urgent now that powerful agents are emerging to actually make use of it.