INTRODUCTION

As sports culture has continued to grow globally, so has the interest in and use of sports data analytics to predict game results. The World Cup is the most viewed tournament in the world, and the ability to make predictions on its outcomes is in high demand by sports statisticians, soccer fans, and even gamblers. This tournament has the highest volume of betting of any sport, and approximately $155 billion was gambled on the 2018 World Cup. Clearly, there are huge sums of money at stake, and we decided to study existing World Cup data to see if we could create prediction models that use previous tournament data to predict future outcomes. Additionally, only eight teams have ever won the World Cup - Brazil (5 titles), Germany (4), Italy (4), Argentina (2), France (2), Uruguay (2), England (1), Spain (1). We wanted to explore if prediction models would always predict one of these eight teams as winners (of individual knockout rounds or the tournament as a whole), or if their past wins are not necessarily indicative of countries with consistently strong soccer teams.

After comprehensively exploring our data, our group was very interested in two particular questions. The first question we decided to examine was: “Does the comeback rate of teams in the group stages affect the success of teams once they are in the knockout stage?” Each World Cup tournament is separated into two stages - the group stage and the knockout stage. The group stage consists of eight groups of four teams, and each team plays the other three in their group once, for six total games. The top two teams from each group move on to the knockout stage, a single-elimination tournament beginning with the round of 16 and continuing through to the final, championship match. This current group stage into knockout round design has been in place since after 1950. We developed this first question due to the set-up of the World Cup based on the theory that teams with high comeback rates should be indicative of a strong team- one who may have had a poor first half but is resilient enough to turn the game around and win. We wanted to test this theory through this question, and were interested in hopefully uncovering patterns that highlight the strength of a team based on their comeback rate in the group stage. These patterns would be very useful for predicting the success of teams based on their knockout-stage performance.

While we were interested in uncovering a relationship between group and knockout stage performance, we also wanted to build a predictive model that would consider multiple factors over time. Because the World Cup structure has changed since its first tournament in 1930, we believe it important to consider group stage performance over time. Therefore, our second question was: “Given the factors we have considered (time; attendance; and home vs away team goals, half-time result, and final result), can we predict if a team will make it to the knockout-stage based on the features of group stage games?”. Through this question, we explored the accuracy of predicting a team’s ability to make it to the knock-out stages depending on variables deemed strong determinants of prediction. While our exploratory analysis suggested that success in the knock-out stage was not a good predictor of winning the final, it is important to consider that there are fewer matches played in the knock-out stage than the group stage. We believe it is more meaningful to explore the way group stage variables over time affects teams’ chances of advancing to the higher stake games. This model could be useful in determining if group stage factors are an accurate predictor of success in the knockout stage.

DATA

This dataset has information regarding the FIFA World Cups and was posted by Andre Becklas on Kaggle, who found the data on the FIFA World Cup Archive website. Out of 20 variables, the relevant variables include year, date, time, stage, city, stadium, home team name, home team goals, away team goals, away team name, attendance, half-time home goals, and half-time away goals. There are 836 observations in this dataset. Beginning in the year 1930, it provides information on every year the FIFA Cups were held with missing information about the 1950 finals. There is also missing information from the 1942 and 1946 tournaments, as they were cancelled due to WWII. Due to some missing data as well as fundamental changes in the structure of the tournament that would potentially skew our models, we also decided to filter the data to include only the tournaments after 1950.

The dataset shows that the World Cup is primarily held in May, June, or July with game times between 11:30am and 10pm. The stage variable describes the stage of the game in the tournament, with group stages being the initial matches. The match for third place is a match for the two losers of the semi-finals. The preliminary rounds only occurred in the first year of the World Cup, 1934. The group stage observations change from letter variations to number variations in the year 1974. The next stages include the round of 16 (also known as first round), quarter-finals, semi-finals, and finals. The city and stadium variables refer to the name of the city and stadium where the match was played. The stadium variable was important in initial data analysis, as it required web scraping for further data to determine the stadium capacity. Home team name and away team name are labeled as the nations those teams represent. Home team plays against the away team, but the location of the actual game has no effect on the label of home or away. The final results are labeled by the home team goals and away team goals, with half-time results labeled as half-time. Attendance varies between 2,000 and 173,850 fans. It is important to note that attendance does not analyze the stadium capacity nor the percentage of the capacity filled. The table below is a condensed version of the original data, focused on the final matches with the half time home team goal count equalling zero.

Year Datetime Stage Stadium City Home Team Name Home Team Goals Away Team Goals Away Team Name Attendance Half-time Home Goals Half-time Away Goals
1934 10 Jun 1934 - 17:30 Final Nazionale PNF Rome Italy 2 1 Czechoslovakia 55000 0 0
1966 30 Jul 1966 - 15:00 Final Wembley Stadium London England 4 2 Germany FR 96924 0 0
1978 25 Jun 1978 - 15:00 Final El Monumental - Estadio Monumental Antonio Vespuci Buenos Aires Argentina 3 1 Netherlands 71483 0 0
1982 11 Jul 1982 - 20:00 Final Santiago Bernabeu Madrid Italy 3 1 Germany FR 90000 0 0
1990 08 Jul 1990 - 20:00 Final Stadio Olimpico Rome Germany FR 1 0 Argentina 73603 0 0
1994 17 Jul 1994 - 12:30 Final Rose Bowl Los Angeles Brazil 0 0 Italy 94194 0 0
1998 12 Jul 1998 - 21:00 Final Stade de France Saint-Denis Brazil 0 3 France 80000 0 2
2002 30 Jun 2002 - 20:00 Final International Stadium Yokohama Yokohama Germany 0 2 Brazil 69029 0 0
2006 09 Jul 2006 - 20:00 Final Olympiastadion Berlin Italy 1 1 France 69000 0 0
2010 11 Jul 2010 - 20:30 Final Soccer City Stadium Johannesburg Netherlands 0 1 Spain 84490 0 0
2014 13 Jul 2014 - 16:00 Final Estadio do Maracana Rio De Janeiro Germany 1 0 Argentina 74738 0 0

As we worked through these two questions, we created new variables based on the variables we were given to help answer our questions. We created Home Team Result, Away Team Result as well as their half time results with the label win, loss and tie. We created the “Goal Differential” variable, which shows the absolute value of difference in goals between the competing teams. We also created variables to indicate individual country success. To do this, we created a win percentage variable for the finals as well as the tournament as a whole to better understand which teams have found true success. Furthermore, we also created a variable to represent a team’s comeback rate. We measured comeback rate by isolating matches where teams were losing at half-time but went on to win the game, aggregated these results and divided by all matches where teams were losing at half-time to find each team’s proportion of games in which they experienced this, hence their “comeback rate”.

We then created datasets isolating the observations based on stages grouped into the group stages and knock-out stages. The knock-out stages consist of round of 16, match for third place, quarter-finals, semi-finals, and final rounds. We analyzed the knock-out stages in further depth as we found interesting results from our initial assessment; primarily that only eight teams have ever won the Fifa World Cup. We wanted to see if this was the best determinant for a quality team as we found that some teams have only won once, specifically England and Spain. Some teams, such as the Netherlands, have won multiple knock-out stage matches but have never won the finals of the World Cup. It is interesting to point out that the Netherlands have won more knockout stage games than Spain and Uruguay and are tied with France in the amount of knockout stage games won, and yet they have never won the World Cup. The graph below depicts the knockout round participation of teams who have made it that far (some years did not include a Round of 16, hence why some countries appear to have never played in this round but have played in higher knockout stage rounds). These findings motivated and formed the basis of both of our questions, as we aimed to determine and predict which teams would win in the future.

RESULTS

In tackling our first question, we created a comeback rate variable representing the percentage of games where a country was losing at half-time but won the match itself. We also created two win percentage variables - one for wins in all knockout stage matches, and one for winning the final match. These two win percentage variables were how we chose to measure success in the knockout rounds. Next, we created a table, as shown below, that displayed each country with their comeback rate, total win percentage, and finals win percentage. From there, we ran two linear regressions. The first regression examined the relationship between group stage comeback rate and win percentage in the knockout stage, and the second examined the relationship between a group stage comeback rate and win percentage of the final, championship match. We also examined the correlations between each set of variables and produced graphs to summarize each relationship.

Country Total Win Percentage Finals Win Percentage Comeback Rate
Argentina 47.62 9.52 0.00
Austria 66.67 0.00 7.69
Belgium 22.22 0.00 7.14
Brazil 65.71 11.43 11.54
Bulgaria 20.00 0.00 0.00
Cameroon 50.00 0.00 0.00
Chile 33.33 0.00 0.00
Colombia 33.33 0.00 0.00
Costa Rica 0.00 0.00 28.57
Czechoslovakia 60.00 0.00 0.00
Denmark 25.00 0.00 0.00
England 44.44 5.56 0.00
France 57.89 5.26 0.00
Germany 63.64 9.09 7.14
Hungary 40.00 0.00 0.00
Ireland 0.00 0.00 0.00
Italy 63.64 4.55 0.00
Japan 0.00 0.00 0.00
Korea Republic 20.00 0.00 7.69
Mexico 11.11 0.00 6.25
Morocco 0.00 0.00 0.00
Netherlands 44.44 0.00 15.38
Nigeria 0.00 0.00 0.00
Paraguay 0.00 0.00 14.29
Poland 50.00 0.00 0.00
Portugal 37.50 0.00 0.00
Romania 25.00 0.00 16.67
Saudi Arabia 0.00 0.00 0.00
Senegal 50.00 0.00 0.00
Soviet Union 14.29 0.00 0.00
Spain 50.00 8.33 17.65
Sweden 44.44 0.00 0.00
Switzerland 0.00 0.00 8.33
Uruguay 21.43 0.00 7.14
USA 20.00 0.00 0.00
Yugoslavia 25.00 0.00 0.00

Our first linear regression examined the relationship between comeback rate in the group stage and win percentage in the knockout stage. While we had hoped to see a relationship, that in turn might indicate that comeback rate in the group stages is a good indicator of future success, the results were not statistically significant. The p-value of comeback rate was notably high, at 0.567. Additionally, examining the correlation between the two yielded a correlation of -0.099. The correlation’s proximity to 0 further supports our findings that there is essentially no relationship between comeback rate in the group stages and win percentage in the knockout round. Furthermore, the graph below visually depicts this lack of a relationship.

Our second linear regression investigated comeback rate in the group stage again, but now studied its relationship with win percentage in the final round of the World Cup. We again hoped to see a relationship to potentially suggest that the “top 8” teams who hold World Cup titles played particularly well, especially when losing at half-time. However, once again the results were not statistically significant. The p-value was slightly lower, but still remarkably high, at 0.4122. The correlation was slightly higher, but still indicated essentially no relationship at 0.14, visually depicted in the graph below.

Ultimately, neither linear regression or correlation calculation indicated a relationship between comeback rate in the group stage and a team’s success in the knockout round. Unfortunately, this means that comeback rate is not a good predictor of a team’s win percentage in the knockout stages. While we had hoped to see a relationship, there are reasonable explanations for the lack thereof. First of all, the low-scoring nature of the game of soccer makes coming back from a loss at half-time not only incredibly difficult but also rare. Additionally, stronger teams rarely are losing at half-time anyway - many excellent teams in the data were shown to have a comeback rate of 0, but this is likely because they are never losing at half-time, especially in the group stages where weaker teams have yet to be eliminated. A comeback rate of 0, in general, does not necessarily indicate playing well or poorly but does prevent us from analyzing how teams perform in the high-pressure situation of trying to come back from a half-time loss. In the same vein, weaker teams rarely come back from a half-time loss both due to their weakness and the low-scoring nature of the game. While comeback rate initially showed potential to be a strong indicator of knockout round performance, further exploration refuted this.

For our second question, we hoped to better understand what factors would best predict which team is most likely to win in the knockout stages using variables from group stage matches. We decided to run two models and determine which is more likely to best predict knockout stage winners. We started by running a backwards selection method that analyzes the entire model, taking one variable away at a time until it achieves the lowest AIC. This method showed that the most significant factors of time, attendance, and home team result are the predictors in determining whether a team makes it to the round of 16. We then created a graph to show the relationship between time and attendance when determining whether a team made the round of 16. We also created a graph to depict a correlation between home team result and time shown below.

We then ran a lasso model, which makes the coefficients closer and closer to 0, until we are left with significant values. This model showed that the most significant variables are away team goals, home team results, and home half results. These results were very interesting to us, because the assignment of a team to either be “home” or “away” is arbitrary and not based on the team’s strength. Therefore, it is surprising that away team goals are significant while home team goals are not. We purposefully did not include away team results or away half results because they would have an exactly opposite correlation. This model correctly predicted teams would not make it to the round of 16 stage 102 times, and incorrectly predicted 27 times. This model correctly predicted teams would make it to the round of 16 match 317 times and incorrectly predicted 139 times.

After running these two models, we analyzed the sensitivity, specificity, false positive rate, and false negative rate of the different models using leave-one-out cross validation. In this method, one observation is left out of the data and the model is fitted to the other observations, then tested on the observation left out. This continues until all individual observations are left out and tested. The results are shown in this table below, with green values indicating where one model was better and red values indicating where one model was worse. It is important to note that we are looking for high sensitivity and specificity as well as low false positive rates and false negative rates. The sensitivity is better when looking at the Lasso model. The false positive rate is the same. The specificity and false negative rate are more desirable for the backwards model. The best model depends on what an individual desires most in an answer. If an individual is looking for the most accurate response regarding true positives, Lasso may fit better. If an individual is looking for the most accurate response regarding false positives, the backwards model may fit better. Therefore, neither model is objectively better.

Model Sensitivity Specificity FPR FNR
Lasso 0.921511627906977 0.423236514522822 0.304355 0.209302325581395
Backwards 0.851744186046512 0.50207468879668 0.304355 0.123486682808717

CONCLUSION

Our first question explored: “Does the comeback rate of teams in the group stages affect the success of teams once they are in the knockout stage?” The search to answer this question was motivated primarily by a desire to see if the “top 8” teams, or the eight countries who have ever won a World Cup title, are the best eight teams in the world or if their wins may have been due to luck or a uniquely strong team in one given year. However, our results showed that there is essentially no relationship between a team’s comeback rate in the group stage and their performance in the knockout round in general or the championship round specifically. Ultimately, while we had hoped to identify a potential indicator of knockout round success, the comeback rate is not a reliable indicator. We believe this is primarily due to the low-scoring nature of soccer making it very difficult to come back from a half-time loss.

Our second question explored: “Given the factors we have considered (time; attendance; and home vs away team goals, half-time result, and final result), can we predict if a team will make it to the knockout-stage based on the features of group stage games?”. The backward selection method found the best model to include time, attendance, and home team result, however, this model did not have strong levels of specificity. The lasso-backward method found the best model to include away team goals, home team result, and home half result. The lasso model has the strongest measure for predicting positive cases and sensitivity and the backwards model is better for predicting negative cases. Overall, using lasso or backward models have the most potential to predict if a team appears in the round of 16 using away team goals, home team result, and home half result, and the choice to use one over the other depends on preferences for sensitivity and FNP.

For our first question, regarding the correlation between comeback rate and win percentage in the knockout stages and final, the results were intended to be used as a predictor on who would win the World Cup. If we had been able to find a strong correlation between the two variables, we would have been able to further investigate if comeback rate was an accurate predictor. Our findings remain important as they first highlight the high level of competition in this tournament. Due to the changing level of skill by all countries, it is hard to find correlations between coming back from games and winning games in the future rounds, but exploring how team skill level affects their comeback rate or total win rate could be interesting for future analysis . The second finding that is important to note is that comeback rates are not a solid statistic to base bets around. Since there is a weak correlation, betting on a team because they tend to come back at a high rate when losing, is not a reliable way to place bets as it does not increase the likelihood that a given team will win. For our second question, our methods showed that the factors of time, attendance, home team result, away team goals, and home half result are the most significant predictors to determine whether a team makes it to the round of 16. This information was slightly unexpected, as we did not predict that these variables would have any significant impact on determining who would advance to the round of 16. It is important to note that the home team result was found to be a significant predictor in both models. The findings are important, as it tells us that if we can understand trends of attendance and time on match winners, we could better predict who would win each game, which would be helpful for those interested in gambling. Unfortunately, our data set lacked many real-world predictors, and we were unable to find any complete, cohesive datasets that had more, so we were unable to perhaps find better predictors such as temperature, altitude, etcetera.

Our second question in particular paves the way for researchers to continue searching for a group stage match statistic that could predict who will win knockout stage games, specifically using home team results. Since this variable was found significant in predicting both models, it could be useful to explore if other models also find home team results as a strong predictor. It may also be worth investigating further whether there is an underlying bias of some sort towards the home team, even though home team assignment is allegedly arbitrary. The next step could be to explore if total team goals and team goal differential throughout the group stage are significant in predicting group stage winners or if there is any correlation among them. When looking at the modeling aspect, there are a lot of different directions that research could go. The vast amount of exogenous variables that exist in tournaments likely need to be represented in some way. As mentioned previously, examining other random variables, such as altitude, temp, etcetera could be a useful way to find better atmospheric predictors of match winners. Although there is not much data that currently exists for the World Cup overtime, it provokes the idea of studying more exogenous variables within tournament play. We think that one of the most useful variables to explore could be the skill of individual team players. Through individual player skill assessment and access to data on player match ratings, goals, assists, key dribbles, shots on goal, and key passes, there exists the possibility of finding significant player statistics that are useful in predicting their team’s success. Another route could be to consider team statistics by the game, such as possession, fouls, shots on goal, and movement patterns of the team. These types of variables give more specific information regarding matches and team quality, and including them in prediction models has the potential to be significant. The skill level of a team is constantly changing over time, and it could be useful to explore how varying skill levels of teams change as their success changes. In general, including variables that represent the skill and technical levels of a team could increase a model’s ability to accurately predict a team’s outcome, and this could lead to more analysis on using a team’s outcome to predict their future success.