Are you ready to take the plunge?

Everybody’s talking about data lakes. But what exactly are they? And why should railway operators be interested? To find out, we spoke to Victor Borges, Digital Product Manager at Thales. 

Q. What is a data lake?

It’s a way of centralising data that comes from different places. Creating a data lake is the first step in any big data initiative: you have to get all the different sources of data together in one place if you’re going to carry out analytics and learn from the data. The big difference between a data lake and a conventional database is that you can fill a data lake with raw data from any source in its original format.

Q. What sort of data are we talking about here?

Just about anything. It ranges from raw data from the Internet of Things (IoT) through to highly-structured data held in databases. For rail operators, data sources include diagnostic information from trackside equipment and systems such as axle counters, track circuits and interlockings. It also includes data from maintenance activities (planned and reactive), ticketing, train control systems and tomorrow, customer journey patterns and train occupancy. External sources of data – such as weather data – can also be very valuable. It all depends on what questions you want to ask and what insights you’re looking for.

Q. How does all of this help my railway network?

The key point is that it’s not just about sending data to a data lake for the sake of it – it’s about putting that data to work for you. With both existing and new data sets, there are a number of data science and machine learning techniques that allow you to identify important attributes to help you make decisions. These decisions could be related to the passenger experience, engineering, marketing, maintenance or operations.

For each question that we are trying to answer using data, we need to know what data is important. We then push that data into the data lake. The data lake is where we do the analysis – a data lake grows over time, both in the volume of data from each source and new sources of data being added – enabling more value to be derived.

Q. Can you give an example of how it all works?

Our predictive maintenance solution, TIRIS, is a good example. One of the capabilities of TIRIS is predicting when rail assets such as point machines will fail. So an obvious starting point for data gathering is data from the point machines themselves. But we know that weather has an impact on reliability, so we take weather data as well – this comes from a completely different source. Maintenance management systems are yet another source of data. And train schedules might also help the prediction through including future demand on the asset into the models.

Next, you create a data pipeline to feed data from these sources into the data lake. Once the data is in place, we do some data manipulation. That’s one of the most important steps: you have to pre-process to add some structure to the data before you can do anything else. 

So once you have that data in place and organised in a way that is useful, you then need to apply some machine learning techniques. There are a number of them: you can have classification models, prediction models, linear regression, deep learning and other artificial intelligence models. There are a number of different classes of models that you can use. 

This is what we’re doing with TIRIS. The predictions utilise data from different sources. In the case of point machines, we can identify the potential time-to-failure for an individual machine, based on a number of variables from different places.

Q. Will I have to migrate my existing databases to the data lake?

No. The data lake is simply a way of pooling data from different sources, then running data science on it. It shouldn’t affect your existing infrastructure or architecture. The “pipeline” between your existing databases and the data lake is achieved using secure interfaces.

Q. What’s the relationship between a data lake and the Thales digital platform?

A data lake is just a huge amount of available information. The Thales digital platform is what enables people to use the data lake. We offer services on the platform that enable customers to interrogate their own data, investigate trends in the data and create their own data science capability if they wish to. TIRIS is just one example of such services.

Q. Finally, what sort of developments do you foresee?

We’re getting increasingly innovative in the ways we look at data. Take Traffic Management Systems. What kinds of trains are moving? What is the weight of the train? What types of locomotives are being used? All of this information could be connected to operational support tools, such as predictive maintenance. That’s why the data lake is so important: it allows us to make new connections and look at new ways that data can help us.