Showing posts with label transformer. Show all posts
Showing posts with label transformer. Show all posts

Saturday 18 February 2023

Demystifying BERT - Things you should know before trying out BERT

 Introducing BERT


What is BERT?
BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

What makes BERT different?
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).

Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

The Strength of Bidirectionality If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.

To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.

BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
How i extended BERT for chat bot

Already pretrained BERT was fine tuned on SQUAD database.
The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words).

The model is fine tuned in Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles


Deployment

I used Flask - A web services' framework in Python, to wrap a machine learning Python code into an API.
Training and Maintenance

  • Google BERT is a pre-trained model and there is no training involved.
  • You can fine tune it though like i did on SQUAD data set.
  • If you can spend some time on understanding the underlying code you can customize it to better suit your domain and requirement, like we did.
  • Once the code is deployed it needs to be constantly monitored and evaluated to understand improvement scope.
  • No day-to-day training is required.
Infra spec

Though the BERT pre-trained model should be able to run on any infra spec that is generally advised for any analytics use case, the infra that Google has advised for fine tuning is on the higher end by non-Google standard.
(Though BERT without fine tuning is also efficient, fine tuning result in substantial accuracy improvements.)

As per Google –

  • Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
  • All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small.
  • The fine-tuning examples which use BERT-Base should be able to run on a GPU that has at least 12GB of RAM using the hyperparameters given.
  • Most of the examples below assumes that you will be running training/evaluation on your local machine, using a GPU like a Titan X or GTX 1080.

References I used for my learning and some content

Friday 10 February 2023

Transformer and its impact on Language Understanding in Machine Learning

‘Transformer’ a deep learning model was proposed in the paper ‘Attention is all you need’ which relied on a mechanism called attention, and ignored recurrence (which lot of its predecessors depended on) , to reach a new state of the art in translation quality, with significant more parallelization and much less training.


Due to its non-reliance on recurrence, the transformer does not require sequence data be processed in order, that allows transformers to allow parallelization and thus being able to train on much larger dataset on much lesser time. Transformer has enabled development of pre-trained models like BERT which has enabled much needed transfer learning. Transfer learning is a machine learning method where a model trained for a task is reused as starting point for another task. And as with BERT this approach delivers several advantages considering the vast compute and resource that is required to train the base model. BERT is a large model (12-layer to 24-24 layer transformer) that is trained on a large corpus like Wikipedia for a long time. It can then be fine-tuned to specific language tasks. Though Pre-training is expensive, fine-tuning is inexpensive. The other aspect of BERT is it can be adopted to many type pf NLP tasks. This enables the organizations to leverage large models build on transformers trained on massive datasets without spending significant resource on time, computation and effort and applying them on their own NLP tasks.
One of the challenge of using pre-trained transformer enabled models are they are trained and thus learns language representation of general English in case the language is English. Thus even strong base model might not always perform accurately when applied to domain specific context. Like – Credit could mean different thing when applied to general English or financial context. Significant fine tuning on specific tasks and customer use cases may go a long way on solving this.
Research is being done if similar transformer based architecture can be applied to video understanding. The thought process is if video can be interpreted as sequences of image patches extracted from individual frames, similar to how sentences are sequences of individual words in NLP.