Model evaluation

This example illustrates how to evaluate a model’s performance on soccer historical data.

# Author: Georgios Douzas <gdouzas@icloud.com>
# Licence: MIT

import numpy as np
from sportsbet.datasets import SoccerDataLoader
from sklearn.neighbors import KNeighborsClassifier

Extracting the training data

We extract the training data for the spanish league. We also remove any missing values and select the market maximum odds.

dataloader = SoccerDataLoader(param_grid={'league': ['Spain']})
X_train, Y_train, Odds_train = dataloader.extract_train_data(
    drop_na_thres=1.0, odds_type='market_maximum'
)

Out:

Football-Data.co.uk: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

The input data:

X_train
home_team away_team league division year home_team_soccer_power_index away_team_soccer_power_index home_team_probability_win away_team_probability_win probability_draw home_team_projected_score away_team_projected_score match_quality
date
2016-08-19 La Coruna Eibar Spain 1 2017 66.52 62.29 0.5003 0.2260 0.2738 1.47 0.79 64.335545
2016-08-19 Malaga Osasuna Spain 1 2017 72.57 56.93 0.5475 0.1897 0.2628 1.56 0.70 63.805561
2016-08-19 La Coruna Eibar Spain 1 2017 66.52 62.29 0.5003 0.2260 0.2738 1.47 0.79 64.335545
2016-08-19 Malaga Osasuna Spain 1 2017 72.57 56.93 0.5475 0.1897 0.2628 1.56 0.70 63.805561
2016-08-20 Barcelona Betis Spain 1 2017 96.35 69.95 0.9591 0.0071 0.0337 3.40 0.42 81.054510
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-10-28 Granada Getafe Spain 1 2022 58.45 65.57 0.3631 0.3127 0.3242 1.05 0.95 61.805620
2021-10-28 Celta Sociedad Spain 1 2022 71.68 78.29 0.3206 0.3957 0.2837 1.17 1.34 74.839331
2021-10-28 Levante Ath Madrid Spain 1 2022 63.23 84.94 0.1873 0.5664 0.2463 0.91 1.77 72.494516
2021-10-28 Granada Getafe Spain 1 2022 58.45 65.57 0.3631 0.3127 0.3242 1.05 0.95 61.805620
2021-12-20 Levante Valencia Spain 1 2022 60.85 72.40 0.3228 0.4059 0.2713 1.26 1.45 66.124428

18815 rows × 13 columns



The targets:

Y_train
away_win__full_time_goals draw__full_time_goals home_win__full_time_goals over_2.5__full_time_goals under_2.5__full_time_goals
0 False False True True False
1 False True False False True
2 False False True True False
3 False True False False True
4 False False True True False
... ... ... ... ... ...
18810 False True False False True
18811 False False True False True
18812 False False True True False
18813 False False True True False
18814 True False False True False

18815 rows × 5 columns



Splitting the data

We split the training data into training and testing data by keeping the first 80% of observations as training data, since the data are already sorted by date.

ind = int(len(X_train) * 0.80)
X_test, Y_test, Odds_test = X_train[ind:], Y_train[ind:], Odds_train[ind:]
X_train, Y_train = X_train[:ind], Y_train[:ind]

Training a multi-output classifier

We train a KNeighborsClassifier using only numerical features from the input data. We also use the extracted targets.

num_features = [
    col
    for col in X_train.columns
    if X_train[col].dtype in (np.dtype(int), np.dtype(float))
]
clf = KNeighborsClassifier()
clf.fit(X_train[num_features], Y_train)

Out:

KNeighborsClassifier()

Estimating the value bets

We can estimate the value bets by using the fitted classifier.

Y_pred_prob = np.concatenate(
    [prob[:, 1].reshape(-1, 1) for prob in clf.predict_proba(X_test[num_features])],
    axis=1,
)
value_bets = Y_pred_prob * Odds_test > 1

We assume that we bet an amount of +1 in every value bet. Then we have the following mean profit per bet:

profit = np.nanmean((Y_test.values * Odds_test.values - 1) * value_bets.values)
profit

Out:

0.09769545575338826

Total running time of the script: ( 0 minutes 38.519 seconds)

Gallery generated by Sphinx-Gallery