To successfully tackle artificial intelligence (AI) workloads is not just about throwing compute and storage resources at it. Sure, you need enough processing power and the storage to supply it with data at the correct rate, but before any such operations can achieve success, it’s critical to ensure the quality of data used in AI training.
That’s the core message from Par Botes, vice-president of AI infrastructure at Pure Storage, whom we caught up with last week at the company’s Accelerate event in Las Vegas.
Botes emphasised the need for enterprises tackling AI to capture, organise, prepare and align data. That’s because data can often be incomplete or inappropriate to the questions AI tries to answer.
We talked to Botes about data engineering, data management, the use of data lakehouses and making sure datasets fit the need being addressed by AI.
What does Pure Storage view as the key upcoming or emerging storage challenges in AI?
I think it’s hard to create systems that solve problems using AI without having a really good way of organising data, capturing data, then preparing it and aligning it to the processing elements, the GPUs [graphics processing units], that make them access data fast enough.
What in particular makes those challenges difficult?
I’ll start with the most obvious one: how do I get GPUs to consume the data? The GPUs are incredibly powerful, and they drive a tremendous amount of bandwidth.
It’s hard to feed GPUs with data at the pace we consume it. That is starting to increasingly become solved, particularly at the high end. But for a regular enterprise type of company, these are new types of systems and new types of skills they have to implement.
![]()
“As your data improves, as your insights change, your data has to change with it. Thus, your model has to evolve with it. This becomes a continuous process”
Par Botes, Pure Storage
It’s not a hard problem on the science side, it’s a hard problem in operations, because these are not muscles that have existed in enterprise for a long time.
The next part of that problem is: How do I prepare my data? How do I gather it? How do I know where I have the correct data? How do I assess it? How do I track it? How do I apply lineage to it to see that this model is trained with this set of data? How do I know that it has a complete dataset? That’s a very hard problem.
Is that a problem that varies between customer and workload? Because I can imagine one might know, just by the expertise that resides within an organisation, that you have all the data you need. Or, in another situation, it might be unclear whether you do or not.
It’s pretty hard to know, without reasoning about [whether] you have all the data you need. I’ll give you an example.
I spent many years building a self-driving car – perception networks, driving systems – but frequently, we found the car didn’t perform as well in some conditions.
The road turned left and slightly uphill, with other cars around it. We then realised we didn’t have enough training data. So, having a principled way of reasoning about the data, reasoning about completeness, reasoning about the range [of data], and to have all the data for that, and analysing it mathematically, is not a discipline that’s super common outside of high-end training companies.
Having looked at the issues that tend to arise, the difficulties that can arise with AI workloads, how would you say that customers can begin to mitigate those?
The general approach I recommend is to think about your data engineering processes. So, we partner with data engineering companies that do things like lakehouses.
Think about: How do I apply a lakehouse to my incoming data? How do I use my lakehouse to clean it and prepare it? In some cases, maybe even transform it and make it ready for the training system. I will start by thinking about the data engineering discipline in my company and how do I prepare that to be ready for AI?
What does data engineering consist of if you drill down into it?
Data engineering generally consists of how do I get access to other datasets that can exist in corporate databases, in structured systems, or in other systems we have, and how do I get access to that? How do I ingest that into an intermediate form that I lakehouse? And how do I then transform that and select data from those sets that might be across different repositories to create a dataset that represents the data I want to train against.
That’s the discipline we typically call data engineering. And it’s becoming a very distinct skill and a very distinct discipline.
When it comes to storage, how do customers support data lakehouses with storage? In what forms?
Today, what’s common is you have the cloud companies, which provide the data lakehouses, and for the on-prem, we have the system houses.
We work with several of them. We provide complete solutions that include data lakehouse vendors. And we partner with those.
And then, of course, the underlying storage that makes it perform fast and work well. And so the key components, I’d say, are the popular data lakehouse databases and the infrastructure beneath that, and then connect those over into other storage systems for the training side.
Looking at data engineering, is it really a one-time, one-off challenge, or is it something that’s ongoing as organisations tackle AI?
Data engineering is kind of hard to disentangle from storage. They’re not exactly the same thing, but they’re closely related.
Once you start using AI, you want to record all new data. You want to transform it and make it part of your AI system, whether you’re using that with RAG [retrieval augmented generation] or fine-tuning, or if you are advanced, you build your own model.
You’re constantly going to increase it and make it better. As your data improves, as your insights change, your data has to change with it. Thus, your model has to evolve with it.
This becomes a continuous process.
You have to think about a few things, such as lineage. What’s the history of this data? What originated from where? What’s consumed where? You want to think about, when people use your model or when you internally use your model. What’s the question being asked? What’s the question that comes up with it?
And you want to store and use that for quality assurance, also for further training in the future. This becomes what we call an AI flywheel of data. The data is constantly ingested, consumed, computed, ingested, consumed, computed.
And that circle doesn’t stop.
Is there anything else you think customers ought to be looking at?
You should also think, what is this data really, what does the data represent? If this data represents something you observe or something you do, if you have gaps in the data, the AI will fill in those gaps. When it fills in those gaps wrongly, we call it hallucination.
The trick is to know your data well enough that you know where there are gaps. And if you have gaps, can you find ways to fill out those gaps? When you get to that level of sophistication, you’re starting to have a really impressive system to use.
Even if you start with the very basics of using a cloud service, start by recording what you send and what you’re getting back. Because that forms the basis for your data management discipline. And when I use the term data engineering, in between data engineering and storage is this discipline called data management.
This is the organisation of data, which you want to start as early as you can. Because by the time you get ready to do something beyond just using the service, you now have the first body of data for your data engineers and for your storage.
That’s a tremendous insight that I wish everyone would consider doing really quickly.