Ideas for good

Saturday, 18 February 2023

The art of story telling - The batting prowess of Sachin Tendulkar

   A visualization is as good as the story it captures, giving user a context of the data and ending with an insight they may not had before. Below is a link to the story board created by me on tableau public analyzing the centuries scored by the “Great” Sachin Tendulkar in Tests. Though I have always been a fan of the great man and have always followed his stats closely, there were certain insights I came to know while creating this story.

https://public.tableau.com/profile/anirban.dutta#!/vizhome/SachinTestCenturiesanalysis/Sachinstestcenturyanalysis 

Sachin scored record 51 centuries in tests. He scored a century against every test playing nation.

He scored at least a century in every test playing country except Zimbabwe.

He scored more centuries away than in home.

His best year in terms of test century was the penultimate year of his career.

Other than Chennai and Nagpur in India, Sydney was another venue where he scored 3 test centuries.

India wins 40 % of matches and does not loose 80 % of matches when Sachin scores a century

Data analysis of the COVID financial decoupling

Recently I was asked by a good friend of mine, to explain with data, the analytics of decoupling of the real economy and the financial market, or in other words the decoupling of the main street and the wall street. The whole context of the discussion was based on the few months of the pandemic aftermath when the main street economy or the real economy came to a stop whereas the stock market kept going up. One may find many resources explaining the economics logic of it like stimulus liquidity, futuristic outlook etc., what I concentrated was more on how we are looking at the data.

First what I tried to do was avoid trying to understand the broad economy looking at the index. Indexes like Nifty50 – example for India may not always be the best way to understand the broader market. (My explanation will be based on India financial market, but I am sure the same can be applied to any country financial market). The NIFTY Index is reconstituted every six months. And it combines different companies in different weightages. So, in a way it can be said it represents the best or most optimistic representation of the economy.

Next what I did was I created an index of my own – while trying to represent companies from different caps and industries in equal weightages. Below show the two indexes –
So, what we saw, though both were impacted by the pandemic in equal way, the broader market hasn’t been doing well since 2018, whereas the main indexes have been doing better. I left it to the economists and administrators to understand what may have led to this. May be consolidation of markets towards large caps, more index investing etc. What I suggested my friend was he should not only look at main indexes as indicators of the economy but try to understand the broader market a whole.

But my takeaway was how we look at a data and how well we understand the domain makes a huge difference how well we analyze the use case.

Friday, 10 February 2023

Transformer and its impact on Language Understanding in Machine Learning

‘Transformer’ a deep learning model was proposed in the paper ‘Attention is all you need’ which relied on a mechanism called attention, and ignored recurrence (which lot of its predecessors depended on) , to reach a new state of the art in translation quality, with significant more parallelization and much less training.

	Due to its non-reliance on recurrence, the transformer does not require sequence data be processed in order, that allows transformers to allow parallelization and thus being able to train on much larger dataset on much lesser time. Transformer has enabled development of pre-trained models like BERT which has enabled much needed transfer learning. Transfer learning is a machine learning method where a model trained for a task is reused as starting point for another task. And as with BERT this approach delivers several advantages considering the vast compute and resource that is required to train the base model. BERT is a large model (12-layer to 24-24 layer transformer) that is trained on a large corpus like Wikipedia for a long time. It can then be fine-tuned to specific language tasks. Though Pre-training is expensive, fine-tuning is inexpensive. The other aspect of BERT is it can be adopted to many type pf NLP tasks. This enables the organizations to leverage large models build on transformers trained on massive datasets without spending significant resource on time, computation and effort and applying them on their own NLP tasks.
 
	One of the challenge of using pre-trained transformer enabled models are they are trained and thus learns language representation of general English in case the language is English. Thus even strong base model might not always perform accurately when applied to domain specific context. Like – Credit could mean different thing when applied to general English or financial context. Significant fine tuning on specific tasks and customer use cases may go a long way on solving this.
 
	Research is being done if similar transformer based architecture can be applied to video understanding. The thought process is if video can be interpreted as sequences of image patches extracted from individual frames, similar to how sentences are sequences of individual words in NLP.