The following analysis is an exercise I performed to predict the outcome of EURO 2016 .My resources were just a spreadsheet and some publicly available data.
Some of the hypothesis taken for
this prediction are:
- Players playing in bigger clubs(club ratings as per UEFA) will perform better in Europe.
- Best players in European countries play in European club.
- The strength of the 23-member squad will better determine a nation's performance rather than just starting 11.
- Every position be it goal-keeping, defense, midfield or striker matters equally to the success of a nation.
Factors not taken into consideration:
- Form of the player
- Team work
- Confidence of an individual player
- Home advantage
- Injury
- Credibility of the manager
Steps followed in the analysis:
Step 1: 23 squad members list of each
country is collected -- their names and the clubs they play for.
Step 2: List of 400 clubs across Europe
was collected and their standings as per UEFA.
For my final analysis I only considered
top 100 clubs from the list of 400 clubs with the hypothesis that a
player can only make an impact if he plays among the top 100 clubs in
Europe. I divided the 100 clubs in 10 segments of 10 clubs each. Then
I rated each club with the top segment getting 10 points and the
bottom getting 1 respectively. Then I looked up all the players in
each nation and rated them based on their club ratings. The result is
a cumulative rating for each country. Then going to the fixtures I
concluded on the results with the assumption that a nation with
higher rating will progress through the tournament whereas a nation
with a lower rating will not.
Prediction :
Quarter Final
teams:Ukraine,Spain,England,Belgium,Germany,Italy,France,Russia
Semi Final
teams:Spain,Belgium,Germany,France
Final:Spain,Germany
Champion:Spain
Runner-up:Germany
The analysis may seem simplistic, but
one of the main objective of the exercise is to encourage readers to
start doing analysis on simple use cases and realize beyond the smoke
screen of data science jargon that its not that complicated.Once an
analyst completes a use case like this, he experiences a complete
analytics life cycle.But what will generally change for a detailed
analysis will be -- the volume of data and the numbers of factors to
be considered..
Learnings that we may receive from this
exercise can be -
Searching for the right data : The use
case and the hypothesis tells us what data to search, gather or ask
the business for.Lots of time in my career I have seen analysts being
handed some data and asked to find something interesting.That should
not be the case.It should be the business use case driving the
analysis.
Data preparation: You must have heard
the 80-20 rule,where 80 % time is spent preparing the data and 20 %
time doing the actual analysis.My data was web links so I had to
scrape it, massage it and clean it to get it in a shape that can be
used for analysis.
Feasibility of the variables to
consider : The complexity and the accuracy of the algorithms mostly
depend on the suitability of the algorithm for the use case, the
extent of variables considered and the size of the data analyzed.
Looking at the timeline and resources in hand one should decide how
extensively one wants to go about it.
Consideration of hypothesis :
Hypothesis considered should be clearly mentioned as part of the
analysis.The result of the analysis will prove or disprove our
hypothesis.