Disclaimer: The F1 FORMULA 1 logo, F1 logo, FORMULA 1, F1, FIA FORMULA ONE WORLD CHAMPIONSHIP, GRAND PRIX and related marks are trademarks of Formula One Licensing BV, a Formula 1 company. All rights reserved. I’m just a dude in a basement with too much time on my hands.
This post is the third part of my approach to use ML.NET to make some predictions for strategies that teams will approach in Formula1. I highly suggest reading the first part, where I explain the approach I took.
Spanish Grand Prix is probably the most predictible one
Usually at the beginning of each season teams spend a few days at the Catalunya racetrack testing. This year was exception with Bahrain due to the Covid-19 restrictions. Just to give you an example – over the whole of his career, Kimi Räikkönen did 5983 laps in testing and only 843 while racing around this track. Nearly 90% of laps he’s done at this track where while testing. Teams have a lot of data on this track and know it very well. This makes racing here predictable and to be honest quite boring. But Boring is good for predictions.
Changes in the dataset
The overall data structure stayed the same. As previously, I added data on all new races in 2021 since previous post (Portimão and Imola) and data from Spanish Grand Prix from 2017-2020 period. This give us 250 new rows and brings the total up to around 750.
Changes in the code
In the previous part I used the default regression model for training, and it didn’t perform very well. So I decided to give the AutoML a try. AutoML dos what the name suggests – automatically looks for the best model. You declare how much time you want to spend on an experiment and let the ML.NET try out multiple algorithms with different parameters:
uint experimentTime = 600;
ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto()
.CreateRegressionExperiment(experimentTime)
.Execute(trainingData, progressHandler: null, labelColumnName: "Laps");
Then you can pick up the best model out of that experiment, and you can run predictions using it:
RunDetail<RegressionMetrics> best = experimentResult.BestRun;
ITransformer trainedModel = best.Model;
var predictionEngine = mlContext.Model.CreatePredictionEngine<TyreStint, TyreStintPrediction>(trainedModel);
var lh = new TyreStint() { Track = "Bahrain International Circuit", TrackLength = 5412f, Team = "Mercedes", Car = "W12", Driver = "Lewis Hamilton", Compound = "C3", AirTemperature = 20.5f, TrackTemperature = 28.3f, Reason = "Pit Stop" };
var lhPred = predictionEngine.Predict(lh);
The downside is, I haven’t figured out yet how to actually feed all that feature transformations we did in previous posts into the AutoML flow. So we cannot mark manually which data columns are categories or add new features like we did with distance. And predicting number of laps on tracks with different lap length makes me feel the results may be not so reliable. But let’s roll with it, and see where that leads us.
And the metrics look quite promising. The best algorithm after 10 minute experiment ended up being Fast Tree Regression. And for the first time we have non-negative R-squared. That’s the best performing model on test data we had so far, even though we took a step back with the laps instead of distance.
=============== Training the model ===============
Running AutoML regression experiment for 600 seconds...
Top models ranked by R-Squared --
| Trainer RSquared Absolute-loss Squared-loss RMS-loss Duration |
|1 FastTreeRegression 0,3971 5,49 52,55 7,21 2,2 |
|2 FastTreeRegression 0,3880 5,53 53,86 7,28 3,6 |
|3 FastTreeRegression 0,3731 5,50 54,11 7,32 18,0 |
===== Evaluating model's accuracy with test data =====
*************************************************
* Metrics for FastTreeRegression regression model
*------------------------------------------------
* LossFn: 51,19
* R2 Score: 0,31
* Absolute loss: 5,29
* Squared loss: 51,19
* RMS loss: 7,15
*************************************************
Check out tha latest version of the code in the repository.
Predictions for Spanish Grand Prix
Without further ado, let’s do predictions for the Spanish Grand Prix. All the first 10 drivers will start on the soft tyres, which are C3 compound in Spain.
In the table below I put my predictions and the actual first stints (since it’s already after the GP). What’s very suspicious is that each of the teammates got the same scores, which makes me think that the model gave a lot of weight to the team, and not much to the driver. I was also surprised with those results being so high for soft tyres. But C3 is actually pretty hard compound for “soft tyre” (read the first post for more details on tyre compounds) and they ended up being quite reasonable. Similarly to the previous post, I bolded out the results that ended up within 10% of the actual value. Half of them were pretty close, which is consistent with metrics suggesting this is our best model yet.
Driver | Compound | Prediction | Actual |
---|---|---|---|
Max Verstappen | C3 | 26,7 | 24 |
Valtteri Bottas | C3 | 26,9 | 23 |
Lewis Hamilton | C3 | 26,9 | 28 |
Carlos Sainz | C3 | 26,7 | 22 |
Sergio Pérez | C3 | 26,7 | 27 |
Lando Norris | C3 | 26,5 | 23 |
Charles Leclerc | C3 | 26,7 | 28 |
Daniel Ricciardo | C3 | 27,2 | 25 |
Esteban Ocon | C3 | 26,8 | 23 |
Fernando Alonso | C3 | 26,8 | 21 |
What’s next
For the next post that will be in conjunction with the Monaco Grand Prix, I’ll try to feed that feature information we engineered in the first two posts into the AutoML and see if that helps that model get even better. But generally I feel we’re getting to limits what’s possible with that approach. We have a lot of categorical data and not so much numeric data, which makes it a bit hard for regression. I have a few ideas for new models, and we’ll explore them in future parts.