Disclaimer: The F1 FORMULA 1 logo, F1 logo, FORMULA 1, F1, FIA FORMULA ONE WORLD CHAMPIONSHIP, GRAND PRIX and related marks are trademarks of Formula One Licensing BV, a Formula 1 company. All rights reserved. I’m just a dude in a basement with too much time on my hands.
This post is the fourth part of my approach to use ML.NET to make some predictions for strategies that teams will approach in Formula1. I highly suggest reading the first part, where I explain the approach I took.
Monaco Grand Prix is all about tradition and prestige.
The first race on the streets of Monaco took place in 1929, although not as a part of Formula 1. The track changes a bit over time, but in general stayed the same, as the street layout didn’t change. It is narrow with tight corners and nearly no run-off areas. There’s no room for error. It is the most prestigious race, and even though there are no extra points for winning it, it has a special place in many drivers hearts. The racing is not fascinating as overtaking is hard and usually order doesn’t change much from qualification, unless some accidents happen (which is quite possible on the tight circuit). The real charm is not in racing in Monaco but in the spectacular views, accompanying events and overall glamour of the race weekend.
In terms of tyres, the streets are resurfaced every year and the surface is not very abrasive compared to other races. Traditionally the three softest tyres are used for the grand prix, so the hardest tyre here (C3) is the same as the softest one in Spain. And it’s nearly never used as it’s simply not grippy enough.
Changes in dataset
For reasons mentioned below, I decided to add the “Distance” column (by calculating Track Length times number of laps) to the dataset physically. I also added data on all new races in 2021 since previous post (Spanish Grand Prix) and data from Monaco Grand Prix from 2017-2019 period (2020 was initially postponed and eventually cancelled). This give us around 200 new rows and brings the total up to around 950.
Changes in code
I continued on top of the previous post experiment with AutoML. The problem I had is I couldn’t pass a pipeline to the AutoML to make use of the additional information I have about the dataset to make training easier. But I found a way to pass information about which columns are categories and which are numerical using the ColumnInformation object:
var experimentResult = mlContext.Auto()
.CreateRegressionExperiment(experimentTime)
.Execute(
trainingData,
testingData,
columnInformation: new ColumnInformation()
{
CategoricalColumnNames =
{
nameof(DistanceTyreStint.Team),
nameof(DistanceTyreStint.Car),
nameof(DistanceTyreStint.Driver),
nameof(DistanceTyreStint.Compound),
nameof(DistanceTyreStint.Reason)
},
NumericColumnNames =
{
nameof(DistanceTyreStint.AirTemperature),
nameof(DistanceTyreStint.TrackTemperature)
},
LabelColumnName = nameof(DistanceTyreStint.Distance)
},
preFeaturizer: customTransformer
);
I also experimented with using PreFeaturizer to pass information about converting the Laps feature into covered Distance feature (check out part 2 for details on that). This didn’t work out so well. It seems that it works OK for features that are not used as labels, but in this case looks like the model gets confused as it’s fitting for a feature that’s not in the input dataset. For more information on using custom transformers with AutoML checkout information in this repository.
So instead I decided to add that calculated feature (Distance = TrackLength * Laps) directly into the dataset and use it as our training label. So I dropped the PreFeaturizer, but kept the ColumnInformation as it looks like it increased the performance of the model.
Let’s look at those metrics after those changes:
=============== Training the model ===============
Running AutoML regression experiment for 600 seconds...
Top models ranked by R-Squared --
| Trainer RSquared Absolute-loss Squared-loss RMS-loss Duration |
|1 FastTreeTweedieRegression 0,4342 26005,21 1059975156,71 32557,26 0,9 |
|2 FastTreeTweedieRegression 0,4337 26270,96 1061084915,03 32574,30 0,6 |
|3 FastTreeTweedieRegression 0,4245 26281,03 1078213169,63 32836,16 0,8 |
===== Evaluating model's accuracy with test data =====
*************************************************
* Metrics for FastTreeTweedieRegression regression model
*------------------------------------------------
* LossFn: 1059975153,93
* R2 Score: 0,43
* Absolute loss: 26005,21
* Squared loss: 1059975156,71
* RMS loss: 32557,26
*************************************************
0,43 on the R-squared is another step up from 0,31 from last week and negative in the first two weeks. I’m happy with that progress!
The version of the code used in this post can be found here.
Predictions for Spanish Grand Prix
So here are the predictions for the first stint on the soft tyres for the top 10 on the starting grid. I’ll update actual values after the race.
Driver | Compound | Prediction | Actual |
---|---|---|---|
Charles Leclerc | C5 | 21,3 | 0 (DNS) |
Max Verstappen | C5 | 22,0 | 35 |
Valtteri Bottas | C5 | 22,1 | 29 |
Carlos Sainz | C5 | 21,3 | 33 |
Lando Norris | C5 | 19,4 | 31 |
Pierre Gasly | C5 | 22,8 | 32 |
Lewis Hamilton | C5 | 22,2 | 31 |
Sebastian Vettel | C5 | 14,9 | 33 |
Sergio Pérez | C5 | 22,0 | 36 |
Antonio Giovinazzi | C5 | 23,8 | 35 |
Esteban Ocon | C5 | 27,3 | 37 |
They seem pretty consistent with prediction from the official Formula 1 website (they predict 19-26 laps for the first stint). Let’s see how it will match the reality.
Updated: Updated the prediction using the actual temperatures at the beginning of the race instead of forecast. Added actual values.
The results are a bit away. Monaco is a tricky and unpredictable track and looks like this year’s resurface is much less violent to the tyres. Let’s count it as a failure :)
What’s next
Although we’ve made some progress, I think I’m reaching the limit of the current model. In the next part I’ll reorganize the code a bit. Split the training and inference and add some visualization. All to that to support future support for multiple models and easier visualization of how they differ in performance. My goal is to be able to make automated graphics similar to what you can find in post like this and this.