Goal Prediction Using AI

» Skip to the Extras

From this article, you’ll discover the basics of machine learning and the power of what machine learning can do. I’ll discuss the implementation of different machine learning algorithms for predicting a simple binary classification. This article is mainly focused on beginners who are very new to machine learning. I’ll show you how to use different algorithms on predicting if a goal has been scored or not by Cristiano Ronaldo — the famous Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. He is often considered the best player in the world and widely regarded as one of the greatest players of all time.

Introduction

While watching any football (soccer) match, we want the team that we support to score a goal. We wait patiently until a player from our team gets close enough to the goal and makes a shot ... which seems to be going into the goal but misses, sadly. Can AI (Artificial Intelligence) predict and explain why that player missed the shot from that particular location using data analytics and machine learning?

In this article, we’ll experiment with different ML (machine learning) algorithms and teach an AI to predict if a player will score a goal or not.

Machine learning plays a key role in many different applications such as computer vision, data mining, natural language processing, speech recognition, and others. ML provides potential solutions in all the above-mentioned domains and more. It’s surely going to be a driving force in our future digital civilization.

Here, we’ll see how we can use different machine learning algorithms and build a simple binary classifier which will classify whether a goal can be scored or not, based on the given input data. This project was done as part of a hackathon I participated in and from which the dataset was provided.

In the following section, we’ll go over some basic theory of different machine learning algorithms which we’ll be trying to code and apply in the later part of this article.

Some Theory

There are many machine learning algorithms present. In this project, we’ll be using classification algorithms since we’ll be needing to predict whether a goal is scored or not. This is also called binary classification.

If you have basic knowledge of different machine learning algorithms and types of classification in machine learning, feel free to skip the theory section.

Classification in machine learning is done in two ways: supervised and unsupervised. Supervised learning involves the data set given with the output each data point should produce. The algorithm learns the patterns which produce a certain output and tries to generalize it with supervision.

Supervised learning basically contains the output labels to be predicted in the data set, and learns how to predict those values by backtracking and generalization. Unsupervised learning trains on the data and tries to generalize blindly without knowing what category each data point belongs to. It creates a pattern and generalizes the data points based on its features, and creates output labels for them during the training process.

Here, we’ll be focusing only on the supervised learning method. Some of these methods which we’ll be experimenting with are linear regression, logistic regression, random forest, and neural networks.

To understand how these algorithms work, check the following links which offer great explanations and can help you to understand the workings of these algorithms:

Linear Regression: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86
Logistic Regression: https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
Random Forest: https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/#
Neural Networks: https://towardsdatascience.com/first-neural-network-for-beginners-explained-with-code-4cfd37e06eaf

Let’s start with understanding the inner workings and the approach for our project of predicting if goals are scored or not. Before we start with any of the code, the first thing we should do is to go through the data set thoroughly.

We must understand the data set completely and try to figure out the most important features from the data set which we’ll be using for training our model. Sampling and extracting wrong features might sometimes lead to inaccuracies in your model.

Prerequisites

The following packages will be used in the development of this project. Also, we’ll be using Python 3.6 here, but any version above 3.6 should be fine to use.

Sklearn: Machine learning library.
Pandas: Library used for importing the csv files and parsing the columns.
Numpy: Library used for storing the training data in an array. Numpy arrays are most widely used to store the training data and the sklearn library accepts the input data in the form of numpy arrays.
If any of the libraries are not present in your system, just pip install that particular library. The guide to use pip is at https://www.w3schools.com/python/python_pip.asp.

The following shows the code snippet with all the libraries to be imported:

import pandas as pd import numpy as np import math import scipy from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error from scipy.stats import spearmanr from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import scale from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn import svm from sklearn.ensemble import RandomForestClassifier from sklearn.neural_network import MLPClassifier

Dataset Preprocessing

The data set used here is a csv file with information about different matches played by Real Madrid against different teams. The data set contains different fields describing the goal scored by Ronaldo at different situations and scenarios, like what time was the goal scored, how much distance from the goal, what was the shot power, what angle did he score at, etc. As mentioned before, this data set was given to me during a hackathon, so I’m not sure if it’s publicly available on Kaggle or any other website. Either way, the data set is included in the article downloads or you can get it from https://github.com/rajatkeshri/ZS-HACK-predict-ronaldo-goal.

Figure 1 shows some of the columns for the data set.

FIGURE 1. Dataset in csv format.

From the data set, we’ll be using the following columns as input data: location_x, location_y, power_of_shot, distance_of_shot, remaining_sec, and is_goal column as the output label. You can use the other fields for training the model and experimenting with it, but in this article, I’ll just explain the use of the input field data columns listed above.

You’ll notice there are many fields which are empty and there’s a lot of noise in this data set. Our first step is to go through the data set and fix the noisy data and remove all the empty fields.

Let’s jump to the code now. First, we open the data set csv file using pandas and define the columns which act as the input data and the columns which are the output labels for training.

Here, column 10 is the “is_goal” column which acts like the output label and 2,3,4,5,9 columns are the input labels location_x, location_y, remaining_sec, power_of_shot, and distance_of_shot, respectively. If you want to try training your model with other feature columns from the data.csv, then just add those column numbers in the array.

datasets = pd.read_csv(‘data.csv’)output=pd.DataFrame(datasets) cols = [10] output = output[output.columns[cols]]df = pd.DataFrame(datasets) cols = [2,3,4,5,9] df = df[df.columns[cols]]

Once we have read the features from the csv files, we must go through and remove all the noise from these columns. We loop through the features and check if a particular value for that column if NAN or not. If it’s NAN, we drop the entire row, removing the entire noisy data.

Once the noisy data is removed, we store the entire 2D array of multiple column features in variables X and Y:

#Removing rows with Output label not defined k=0 x=[] for i in df[“is_goal”]: if math.isnan(i): x.append(k) #print(i) k+=1 df=(df.drop(x))#Removing rows with distance of shot not defined k=0 x=[] for i in df[“distance_of_shot”]: if math.isnan(i): x.append(df.index[k]) #print(i) k+=1 df=(df.drop(x))#Removing rows with power of shot not defined k=0 x=[] for i in df[“power_of_shot”]: if math.isnan(i): x.append(df.index[k]) #print(i) k+=1 df=(df.drop(x))#Removing rows with X axis location not defined k=0 x=[] for i in df[“location_x”]: if math.isnan(i): x.append(df.index[k]) #print(i) k+=1 df=(df.drop(x))#Removing rows with Y axis location not defined k=0 x=[] for i in df[“location_y”]: if math.isnan(i): x.append(df.index[k]) #print(i) k+=1 df=(df.drop(x)) #print(df)#Removing rows with remaining time not defined k=0 x=[] for i in df[“remaining_sec”]: if math.isnan(i): x.append(df.index[k]) #print(i) k+=1 df=(df.drop(x)) #print(df)X = df.iloc[:, :-1].values Y = df.iloc[:, 4].values

Now we have our clean data set. The next step is to split the entire data into train and test data. This is done so that we train our model on the train data and then test it for its accuracy and score on the test data. This will help us understand where our model stands in predictions and can thus help us in tweaking the model.

To split into train and test data, we use the function train_test_split which is imported from the sklearn library. Random state basically means the percentage of data which will be used as train and test data; 0.2 means 20%:

(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, random_state=0.2)

Training

Okay. We’ve finished understanding the data set and also cleaned it with some pre-processing. The only step left is training the model. As mentioned earlier, we’ll be training our model using linear regression, logistic regression, random forest regression, and neural network. These algorithms are available within the sklearn library directly and that is what we’ll be using.

First, we create objects of different machine learning algorithms and then pass our input features with output labels to them. These algorithms generalize upon the data we feed them.

To train these models, we call “.fit” on these objects. For more information on each of the machine learning algorithm classes, refer to https://scikit-learn.org/stable.

LR=LinearRegression() Lr=LogisticRegression(random_state=0, solver=’lbfgs’, multi_class=’ovr’) RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0) NN = MLPClassifier(solver=’lbfgs’, alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)LR.fit(X_train,Y_train) #linear regression Lr.fit(X_train,Y_train) #logistic regression RF.fit(X_train, Y_train) #random forest NN.fit(X_train, Y_train) #neural network multi-layer perception model

Once the training is complete, we check how different models have performed. This can be done by calling the “.score” method. The .score method prints an accuracy on how our model performs on the test data:

print(LR.score(X_test,Y_test)) print(Lr.score(X_test,Y_test)) print(RF.score(X_test,Y_test)) print(NN.score(X_test,Y_test))

Also, if we want to check the predictions on our test data or give new data values and predict whether a goal is scored or not, we can do it by using the “.predict” method.

The output produced after the predict method is either 1 or 0, where 1 stands for “Yes, he scored a goal!” and 0 stands for “Hard luck, he will definitely score next time.”

loc_x = 10 loc_y = 12 remaining_time = 20 distance = 32 power_of_shot = 3 custom_input=[[loc_x,loc_y,remaining_time,power_of_shot,distance]]print(LR.predict(X_test)) print(Lr.predict(X_test)) print(RF.predict(X_test)) print(NN.predict(X_test))print(Lr.predict(custom_input))

Results

First of all, congratulations! We have successfully built a binary classification model AI which predicts whether Ronaldo can score a goal or not. We observed that it’s a basic binary classification problem; the logistic regression performs the best with approximately 95% accuracy, but the other machine learning models perform and give a score of approximately 60–70% accuracy. This accuracy can be increased by adding more features to the training of the model.

I hope you enjoyed this article. Cheers! SV

The entire project code is included in the article downloads and can also be found at https://github.com/rajatkeshri/ZS-HACK-predict-ronaldo-goal.