The availability of hardware that supports intensive and sometimes parallel computing has enabled state-of-the-art AI models. The domain of Natural Language Processing (NLP) has greatly benefited from this. In this article, we contemplate if the large language models are going to push smaller-scale models to oblivion. Or will David fight back? But first, let us understand...
What is a Language Model in Machine Learning?
Language models have been the proverbial caffeine shot of NLP. It has made people sit up and take notice of their limitless applications, their social and ethical implications, and definitely new research frontiers. If you were afraid to ask or browsing the web sounds tedious then let us first tell you about NLP and language models.
NLP tasks are machine learning (ML) or artificial intelligence (AI) based solutions to different processing requirements for human language. A common NLP task might be classification of the underlying sentiment associated with a sentence, or paragraph. Now, imagine your legal team wishes to discern if a particular contract carries a neutral tone or is negative towards you, the addressed party. Suppose they can discern the tone only after they have seen hundred such contracts to decide if the sentiment is truly neutral. Voila! You have an NLP task at hand.
Other NLP tasks include extraction of reasoning, or relationship between entities present in text. Furthermore, question-answering and the recent exciting prompt-driven generation tasks have made NLP even more popular.
In the real world, a human child is the most efficient language model. They assimilate the language of their environment, learn patterns, and with time are able to do different tasks on command (if they wish to). In analogy, language models are AI models that leverage probabilistic statistics to understand the language domain from given sample inputs. These models then execute NLP tasks in response to prompts.
How to calculate the Scale of a Large Language Model (LLM)?
Surely a grande Cappuccino is perfect for that leisurely walk to work, or that small shot of Espresso to pump up for a talk. But coffee cups are easier to measure than the scale of AI models.
The scale of language models is based primarily on the trio of number of parameters, training dataset size, and the amount of computation required by the model to train. If we look at the humble Word2Vec model, it could be trained on a dataset of thousands of embeddings with dimensionality that could range from a 100 to 1,000. The model would need extensive re-training to be applied to a different context. On the other hand, GPT-3 was trained on 175 Billion parameters and 12,288 dimensions with an estimated compute time amounting to $5M. The result is that the model performs exceptionally well for different contexts and requires one to few-shots to learn live patterns.
The Evolving AI and Scale of Models
Let us take you through the evolution of humble machine learning to foundation models. Foundation models is an umbrella term for state-of-art models found across different fields of NLP, computer vision and so on. The two words that form the essence of this journey are: homogenization and emergence. Let us delve deeper into these terms.
Homogenization is the term for consolidation of methods to produce ML systems which can be applied to a wide variety of cases. It started with homogenization of algorithms, for example, logistic regression, or supervised classification. Deep learning homogenized model architectures, for example, Recurrent Neural Networks. Currently, LLMs are homogenizing the model itself. Almost all state-of-the-art NLP models are now adapted from models such as BERT, BART, T5, or GPT.
The first aspect of emergence is based on how the tasks were performed. An ability is called emergent if it is not present in smaller models but is present in larger models. In machine learning, inference was an emergent ability that was not seen before. This graduated to deep learning where emergence marked predictions derived from high-level features.
If we concern ourselves with prompting, the emergent abilities of LLMs are few-shot learning. It is the exciting function of giving a natural language instruction to a pre-trained language model, and its ability to complete the task without needing further training or gradient updates to its parameters.
Are LLMs the end of Small Scale Models?
We just showed you how LLMs possess emergent abilities that give them the upper hand over small scale models. So does this mean that if we keep scaling up the models, more sophisticated emergent abilities would emerge? The kicker here is the fact that there is no specific answer to why emergent abilities happened. Scaling of models to reveal more techniques is still a theoretical principle.
“All models are wrong, but some are useful.” - George Box, 1976.
If we look at the emergent ability of few-shot prompted tasks, this technique was not included in the pre-training phase explicitly. Additionally, researchers still do not know the full extent of few-shot prompted tasks and what more can be achieved. While they agree that emergent abilities were observed after a certain scale, model scaling is not the singular factor at play here.
Thanks to hardware availability, parallel computing, and state-of-art architecture, an LLM like GPT-3 which theoretically would have taken 355 years to train on a single NVIDIA Tesla V100 GPU, was trained in 34 days. This re-iterates the ongoing research into training of large language models. It is but a matter of time before emergent abilities may be unlocked in smaller-scale models through improved architectures, training data quality, or disruptive training procedures.
“Data-Centric AI is the discipline of systematically engineering the data needed to successfully build an AI system.”
Andrew Ng, a reputed name in AI, has been bringing back the focus to the data-centric AI movement. His team displayed how data quality approaches can aid model improvement through a collaborative Data-Centric AI Competition. In an interview published in IEEE magazine, he claims that small is the new big. With focus on achieving high quality data instead of improving on model architectures, better results can be achieved from smaller-scale models.
Lastly, many small-scale models have been tried and tested over time. While a pre-trained BERT model might be a good starting point for your project, it could take more resources and efforts to fine-tune it in the longer run. The performance and ease of improvement among smaller-scale models may overweigh the LLMs in such cases.
AI Observability and Language Models
Language models, large or small, are not immune to performance degradation. An AI model that has been trained, tested, and deployed needs AI observability to keep it aligned with the business KPIs as well as ethical compliance. Since NLP models are most commonly found in applications with live users, it is crucial to ensure model health. Here is how AI observability ensures smooth running language models
Monitoring of language models would let your teams ensure data quality especially since production data is used for fine-tuning the model. Additionally, data drift which could be gradual or sudden is known to degrade model performance. However, tracking is effective only if the issues are remedied in time. An automated monitoring system can therefore save on critical resources like your team’s time and enable faster root cause analysis. The Censius AI observability platform is an interactive easy-to-use solution that adds the zing to a pre-emptive strike. Not only can you tackle issues in time but receive reports for future documentation.
Explainability reports help appreciate the decisions taken by the model and the significance of data attributes. While it converts the black-box AI model to a more transparent system, understanding model behavior for different cohorts can help catch bias for overlooked demographics. In NLP, there are many techniques and open-source libraries that offer assignment of importance scores to model inputs, such as SHAP, Language Interpretability Tool (LIT), Captum among others. Worry not! Our highly detailed blog on the top AI Interpretability tools can help you get right into the crux of it.
Language models are trained on commonly available documents and chatbots like ChatGPT are trained on virtually everything present on the internet. It is needless to say that such data sources are rife with human bias and toxicity as well. As showcased by the hacking of Bing chatbot, backdoor prompts may also be misused to poison the learnings of the underlying model. Given the far reach and extensive use of these applications, it is pertinent to keep a check on fairness metrics as well.
In this post, we introduced you to the principles that separate a large language model from its small scaled cousins. We also contemplated if the vast popularity of LLMs would push small scale models to oblivion. Lastly, we also elucidated the need for AI observability when it comes to this particular domain.
If we have stoked your curiosity about how the Censius AI observability platform can help your team to detect and counter issues in a timely manner, then you can get a customized demo or sign up for the 14-day free trial.