Saturday, 18 February 2023

Demystifying BERT - Things you should know before trying out BERT

 Introducing BERT


What is BERT?
BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

What makes BERT different?
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).

Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

The Strength of Bidirectionality If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.

To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.

BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
How i extended BERT for chat bot

Already pretrained BERT was fine tuned on SQUAD database.
The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words).

The model is fine tuned in Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles


Deployment

I used Flask - A web services' framework in Python, to wrap a machine learning Python code into an API.
Training and Maintenance

  • Google BERT is a pre-trained model and there is no training involved.
  • You can fine tune it though like i did on SQUAD data set.
  • If you can spend some time on understanding the underlying code you can customize it to better suit your domain and requirement, like we did.
  • Once the code is deployed it needs to be constantly monitored and evaluated to understand improvement scope.
  • No day-to-day training is required.
Infra spec

Though the BERT pre-trained model should be able to run on any infra spec that is generally advised for any analytics use case, the infra that Google has advised for fine tuning is on the higher end by non-Google standard.
(Though BERT without fine tuning is also efficient, fine tuning result in substantial accuracy improvements.)

As per Google –

  • Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
  • All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small.
  • The fine-tuning examples which use BERT-Base should be able to run on a GPU that has at least 12GB of RAM using the hyperparameters given.
  • Most of the examples below assumes that you will be running training/evaluation on your local machine, using a GPU like a Titan X or GTX 1080.

References I used for my learning and some content

Flavors of Text Analytics, NLP & Cognitive Analytics that every business should try

 























































The art of story telling - The batting prowess of Sachin Tendulkar

   A visualization is as good as the story it captures, giving user a context of the data and ending with an insight they may not had before. Below is a link to the story board created by me on tableau public analyzing the centuries scored by the “Great” Sachin Tendulkar in Tests. Though I have always been a fan of the great man and have always followed his stats closely, there were certain insights I came to know while creating this story.


Sachin scored record 51 centuries in tests. He scored a century against every test playing nation.












He scored at least a century in every test playing country except Zimbabwe.


















He scored more centuries away than in home.












His best year in terms of test century was the penultimate year of his career.



















Other than Chennai and Nagpur in India, Sydney was another venue where he scored 3 test centuries.




















India wins 40 % of matches and does not loose 80 % of matches when Sachin scores a century






















Data analysis of the COVID financial decoupling

Recently I was asked by a good friend of mine, to explain with data, the analytics of decoupling of the real economy and the financial market, or in other words the decoupling of the main street and the wall street. The whole context of the discussion was based on the few months of the pandemic aftermath when the main street economy or the real economy came to a stop whereas the stock market kept going up. One may find many resources explaining the economics logic of it like stimulus liquidity, futuristic outlook etc., what I concentrated was more on how we are looking at the data.

First what I tried to do was avoid trying to understand the broad economy looking at the index. Indexes like Nifty50 – example for India may not always be the best way to understand the broader market. (My explanation will be based on India financial market, but I am sure the same can be applied to any country financial market). The NIFTY Index is reconstituted every six months. And it combines different companies in different weightages. So, in a way it can be said it represents the best or most optimistic representation of the economy.

Next what I did was I created an index of my own – while trying to represent companies from different caps and industries in equal weightages. Below show the two indexes –
So, what we saw, though both were impacted by the pandemic in equal way, the broader market hasn’t been doing well since 2018, whereas the main indexes have been doing better. I left it to the economists and administrators to understand what may have led to this. May be consolidation of markets towards large caps, more index investing etc. What I suggested my friend was he should not only look at main indexes as indicators of the economy but try to understand the broader market a whole.

But my takeaway was how we look at a data and how well we understand the domain makes a huge difference how well we analyze the use case.