Friday, 24 June 2016

Prediction of EURO 2016 using just a spreadsheet and some publicly available data

The following analysis is an exercise I performed to predict the outcome of EURO 2016 .My resources were just a spreadsheet and some publicly available data.

Some of the hypothesis taken for this prediction are:
  1. Players playing in bigger clubs(club ratings as per UEFA) will perform better in Europe.
  2. Best players in European countries play in European club.
  3. The strength of the 23-member squad will better determine a nation's performance rather than just starting 11.
  4. Every position be it goal-keeping, defense, midfield or striker matters equally to the success of a nation.

Factors not taken into consideration:
  1. Form of the player
  2. Team work
  3. Confidence of an individual player
  4. Home advantage
  5. Injury
  6. Credibility of the manager

Steps followed in the analysis:
Step 1: 23 squad members list of each country is collected -- their names and the clubs they play for.
Step 2: List of 400 clubs across Europe was collected and their standings as per UEFA.

For my final analysis I only considered top 100 clubs from the list of 400 clubs with the hypothesis that a player can only make an impact if he plays among the top 100 clubs in Europe. I divided the 100 clubs in 10 segments of 10 clubs each. Then I rated each club with the top segment getting 10 points and the bottom getting 1 respectively. Then I looked up all the players in each nation and rated them based on their club ratings. The result is a cumulative rating for each country. Then going to the fixtures I concluded on the results with the assumption that a nation with higher rating will progress through the tournament whereas a nation with a lower rating will not.

Prediction :

Quarter Final teams:Ukraine,Spain,England,Belgium,Germany,Italy,France,Russia

Semi Final teams:Spain,Belgium,Germany,France

Final:Spain,Germany

Champion:Spain
Runner-up:Germany

The analysis may seem simplistic, but one of the main objective of the exercise is to encourage readers to start doing analysis on simple use cases and realize beyond the smoke screen of data science jargon that its not that complicated.Once an analyst completes a use case like this, he experiences a complete analytics life cycle.But what will generally change for a detailed analysis will be -- the volume of data and the numbers of factors to be considered..


Learnings that we may receive from this exercise can be -

Searching for the right data : The use case and the hypothesis tells us what data to search, gather or ask the business for.Lots of time in my career I have seen analysts being handed some data and asked to find something interesting.That should not be the case.It should be the business use case driving the analysis.

Data preparation: You must have heard the 80-20 rule,where 80 % time is spent preparing the data and 20 % time doing the actual analysis.My data was web links so I had to scrape it, massage it and clean it to get it in a shape that can be used for analysis.

Feasibility of the variables to consider : The complexity and the accuracy of the algorithms mostly depend on the suitability of the algorithm for the use case, the extent of variables considered and the size of the data analyzed. Looking at the timeline and resources in hand one should decide how extensively one wants to go about it.

Consideration of hypothesis : Hypothesis considered should be clearly mentioned as part of the analysis.The result of the analysis will prove or disprove our hypothesis.