Disclaimer: The F1 FORMULA 1 logo, F1 logo, FORMULA 1, F1, FIA FORMULA ONE WORLD CHAMPIONSHIP, GRAND PRIX and related marks are trademarks of Formula One Licensing BV, a Formula 1 company. All rights reserved. I’m just a dude in a basement with too much time on my hands.
This post is the second part of my approach to use ML.NET to make some useful predictions about strategies that teams will approach in Formula1. I highly suggest reading the first part, where I explain the approach I took.
And yes, this episode is also too late to make predictions for Imola and Portimão, but hopefully by Spanish Grand Prix will be all caught up.
We’re not gonna get great results here
If you read the previous part you’re aware that I didn’t get spectacular results in the first run. Unfortunately, we won’t improve much here, because we don’t have much data for those tracks. In the period I focus on (2017-2020) there was only one race on each of those tracks. Imola, although a legendary track and hallowed ground for Formula1, is quite outdated and fell out of grace in recent years. Portimão is a very new track, and Portugal GP promoters were unable to attract Formula 1. Last time we raced in Portugal it was in 1996 on a different track – Estoril. But 2020 was a weird year because of COVID-19, and due to travel restrictions F1 needed to find more tracks in Europe. And so we went to Imola and Portimão.
But there are opportunities!
So even though we don’t expect dramatic improvement adding new tracks gives some new view into the data. Previously, since we only used data from one track, I used the number of laps as our Label – a number we were predicting. Since now, we have one than more track with different lengths we have to find a new feature/label to use. That’s how Distance was added to the model – I calculate the TrackLength times Laps covered.
Changes in dataset
I haven’t added any new columns to the dataset and the feature describe above is calculated in code. But I added a lot of rows. My approach is that for each race I’ll be adding previous 2021 races, and all the races since 2017 on the current track. So by applying this rule, I’ve added Bahrain 2021 (previous race) and Imola and Portimão 2020 (previous races on the current track).
Changes in code
There are two major changes in the code since last iteration. With the 4 new races added, we nearly doubled our data row count (from 250ish to nearly 500). Now I can be a bit more selective which rows I take into calculation. In the first approach, we want to predict strategy in situations where everything goes well, since we cannot predict unpredictable (rain, accidents, mechanical failures).
Removing unnaturally shortened stints
That’s why I decided to filter out all the rows with stints that end with something else than “Pit Stop” (which is regular unforced change of tyres) or “Race Finish”.
This is the piece of code that does it:
// Load data
var data = mlContext.Data.LoadFromTextFile<TyreStint>(DatasetsLocation, ';', true);
// Filtering data
var filtered = mlContext.Data.FilterByCustomPredicate(data, (TyreStint row) => !(row.Reason.Equals("Pit Stop") || row.Reason.Equals("Race Finish")) );
var debugCount = mlContext.Data.CreateEnumerable<TyreStint>(filtered, reuseRowObject: false).Count(); //396
// Divide dataset into training and testing data
var split = mlContext.Data.TrainTestSplit(filtered, testFraction: 0.1);
var trainingData = split.TrainSet;
var testingData = split.TestSet;
FilterByCustomPredicate method from the
mlContext.Data. It takes a function which returns true for the rows we want to remove from the data.
The function itself is this small iniline piece of code:
(TyreStint row) => !(row.Reason.Equals("Pit Stop") || row.Reason.Equals("Race Finish"))
I inserted this in between loading the data from files, and splitting the data, because I want both test and training data to be consistent. This leaves us with 396 rows of data. Nice, that makes me happy.
Using distance instead of laps as a feature
The second change was connected to using Distance instead of Laps as our new Label. I had to create a new feature Distance, by multiplying two other features – Laps and Track Length. ML.NET has a long list of Transforms to support different operations, but what I had in mind was not supported. Fortunately you can also use Custom Mapping and provide it a function that will perform the transformation. I added at the beginning of our calculation pipeline.
var pipeline = mlContext.Transforms.CustomMapping((TyreStint input, CustomDistanceMapping output) => output.Distance = input.Laps * input.TrackLength, contractName: null)
.Append(mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: nameof(TransformedTyreStint.Distance)))
I also needed to recalculate distance back to laps at the end. I do it by again dividing the prediction by the distance of the current track.
prediction.Distance / race.Track.Distance
While looking at the code you’ll also notice that I refactored it a bit. I moved some classes into separate files. I also isolated some data and magic strings. Just general clean up, so step by step we move from one-file script mess into serious working application.
Just for the sake of showing progress of lack of, let’s look at some predictions. I’m not going to predict all tyres for all the drivers. Let’s just take the first 10 qualifiers from the both races and predict how long can they go on the tyres they’ve chosen to start on and compare it to what happened in reality.
The predictions for Imola ended up being pointless. On the morning of the race it rained and all the assigned tyres are cancelled in a situation like that. Teams have to start on either Intermediate or Wet compounds. In the track conditions on that day Wet tyres were a mistake. The track was dump, but there was no standing water. If you look at the table, you will notice that all drivers on Intermediates change tyres around lap 27/28. This is because teams wait for a perfect moment when the track is ready for dry tyres and usually first person blinking and changing to dries will trigger all the other teams to react.
The Portuguese Grand Prix didn’t provide any weather surprises. We had a dry race and could finally test our model. On some rows predictions were quite good. I marked in bold those, where we landed within 10% difference. Sergio’s result is an obvious outlier – he was used by the team to slow down Mercedes drivers to give a better chance to his teammate Max Verstappen. But he was the right man for the job. Sergio’s tyre management is excellent, definitely one of the best in the paddock.
|Sergio Pérez||C2||32,40||51||Sergio is a know “tyre whisperer” ;)|
Next up on the calendar is the Spanish Grand Prix. It’s a well known track to Formula 1 teams. They spend a week there near every year for pre-season testing. Teams like this track for its predictable surface and weather conditions. They know every inch of it and have it perfectly modelled in simulators. All that data makes racing here pathetically boring. But boring is good for predictions, right?
We’ll also take a stab at our model and for the first time try to improve it on top of just adding more data.
Until next time!