5. Does the vendor check that the data covers your use case?
Some vendors will unscrupulously promise you that machine learning will infer abnormal events from unrelated data, by looking at the context, weather, utilisation patterns. Although this can be true in some cases, when it comes to our physical world of tear and wear, and complex failure modes, we have learned that special attention needs to be paid to the data sources.
At Railnova we have failed at something as “simple” as a low-battery alert: we were missing the battery current and found out that the battery voltage was not sufficient to predict the battery state of charge for a certain battery type.
So we went on and spent 2 month featurising battery voltage, maintenance textual data, battery replacement data, weather, temperature, humidity conditions, utilisation patterns, the duration the train was parked before a battery failure, the number of engine starts before a failure, the location, the day of week and time of day, and added another 1000 data features with automatic time series featurisation. We then applied XGBoost (an advanced machine learning model) on our labelled and featurised data set, only to find out that we couldn’t predict low battery failures accurately. We were simply missing the battery current information which prevented us from predicting when this specific battery type was discharged.