BERT and Ernie Were Right All Along

Friday, October 12, 2018

Natural language processing (NLP)—the artificial intelligence (AI) technology behind everything from virtual assistants to automated customer service—is rapidly advancing. But even with these breakthroughs, many of the models powering this tech had a surprising limitation: they could only read one way.

Imagine trying to understand a sentence by reading it left to right, without ever being allowed to glance ahead. That’s essentially how most language models worked. While this might sound subtle, it led to major blind spots, especially when trying to understand real-world language that relies on context: sometimes the most important part of a sentence is at the end.

This limitation created friction in many applications. A chatbot might misunderstand a complaint buried in polite phrasing. A search engine might rank results poorly because it couldn’t fully grasp the user’s intent. And in industries like healthcare or finance, where precision is critical, these gaps could mean real risk.

That’s the problem that the Google researchers behind Bidirectional Encoder Representations from Transformers (BERT) set out to solve. They asked a fundamental question: what if a model could look at all the context (both before and after a word) to understand language more like humans do?

Before BERT, most models read in a single direction, usually left to right. Some clever systems tried to patch this by training two separate models (one forward and one backward), but these were stitched together awkwardly and lacked cohesion. What was missing was a model that could natively understand the full context of a sentence from both sides at the same time.

To solve this, the researchers built a framework on top of an already powerful technology called the Transformer. Think of the Transformer as a flexible, modular architecture that helps machines pay attention to different parts of a sentence… not just word by word, but by also understanding how each word relates to every other word.

BERT takes this a step further by training a deep, bidirectional version of the Transformer. That means, instead of only looking at the words before a given word (as many previous models did), BERT looks both before and after. For example, to understand the meaning of the word “bank” in “he sat on the bank and watched the river,” BERT learns to use the surrounding context (“sat,” “watched,” “river”) to correctly determine that this “bank” is not a financial institution.

But this kind of learning requires a new kind of training. Traditional models predict the next word in a sentence, which only encourages left-to-right learning. BERT flips that idea. Instead of predicting the next word, it masks random words in a sentence and asks the model to guess what they were—using both the words before and after the missing word. This task is called a “masked language model”, or MLM.

BERT also learns relationships between sentences. It’s trained to determine whether one sentence logically follows another. For instance, given two sentences (“She opened the car door.” and “She sat down inside.”), the model learns that the second sentence likely follows the first. This helps BERT handle more complex language tasks that require understanding multiple sentences in context, like question answering or summarization.

By combining deep bidirectional learning with a training setup that mimics how we fill in blanks or infer relationships, BERT brought a more holistic understanding of language to machines. This allowed it to perform remarkably well on a wide range of tasks (without needing entirely different systems for each one). Instead, BERT can be fine-tuned slightly for specific jobs, like answering questions, sorting emails, or identifying entities in text.

Once BERT was trained with this innovative two-part approach (MLM and next sentence prediction), the next step was to put it to the test. To prove its usefulness, the research team didn’t rely on just one benchmark or dataset. Instead, they threw BERT into a wide range of language tasks to see how well it could perform in practical, real-world applications.

The experiments spanned three broad categories: text classification, question-answersing, and named entity recognition (NER) like names or locations) in text. Each of these required different skills. For instance, in question answering, a model has to read a paragraph and then point to the exact phrase that answers a given question. In text classification, the model might need to decide whether two sentences agree with each other or not. And in NER, it has to scan a sentence and label specific words as people, organizations, or other categories.

What made BERT’s testing particularly rigorous was the use of established benchmarks, collections of challenges that many researchers had used in prior years. This meant BERT wasn’t just being evaluated in isolation; it was competing directly with a history of published models. These tests included large, multi-task datasets that tested a model’s ability to generalize across many types of language problems. There were also highly specialized tasks, like extracting facts from articles or inferring commonsense knowledge, where language ambiguity and nuance are especially tough for machines to handle.

To evaluate how well BERT was doing, the researchers focused on standard performance metrics. In simpler terms, they looked at how often the model got the answers right. For tasks like classification (e.g., does this sentence express positive or negative sentiment?), success was measured in terms of accuracy. For more nuanced jobs like question answering or information extraction, they used metrics like precision and recall (how many of the correct answers BERT found, and how many incorrect ones it avoided).

Beyond just measuring raw scores, the team ran what are called “ablation studies.” These are like controlled experiments where you intentionally remove or change parts of the model to see how performance is affected. For example, what happens if you take out the sentence-pair training objective? Or if you make the model smaller? By isolating these elements, the researchers were able to identify which parts of BERT’s architecture and training were truly responsible for its gains.

One key insight was the importance of bidirectionality. When the team replaced BERT’s bidirectional learning with a more traditional left-to-right approach, performance dropped sharply across the board. This confirmed that letting the model see the full context of a sentence was not just a nice-to-have—it was essential.

Another notable finding: BERT’s performance improved dramatically as the model got larger. That is, when the number of layers and parameters increased, so did its ability to understand language. This pointed to an important trade-off—better results came at the cost of more computing power, something only a few organizations could afford at the time.

Lastly, BERT showed a surprising level of flexibility. Once it had been pre-trained, it could be adapted to new tasks with relatively little additional effort. This was a big shift from the norm, where different tasks usually required entirely different models. With BERT, companies and researchers could build once and reuse many times, a major step forward in the efficiency and scalability of language-based AI.

To understand whether BERT was truly a breakthrough or just a one-hit wonder, the researchers didn’t stop at scoring high on benchmark tasks. They wanted to know why it worked, where it worked best, and when it failed. This led them to probe deeper into how BERT behaved in different scenarios… not just whether it got the right answer, but also what factors influenced its performance.

One approach was to stress-test the model’s robustness. Did it still perform well when the training data was smaller? What if you removed one of its key training objectives? Or used a simplified version with fewer layers? These experiments helped the researchers understand BERT’s design strengths and weaknesses. For instance, they discovered that the sentence relationship training (where the model learns to predict whether one sentence follows another) was especially important for tasks involving reasoning between multiple sentences. Without it, BERT’s performance on those tasks took a noticeable hit.

They also paid attention to consistency across different types of language. Was BERT equally good at understanding conversational language, formal writing, or technical text? The results varied, depending on how well those language types were represented in the pre-training data. This pointed to a broader insight: BERT was only as good as the diversity and scale of the language it had seen.

Still, the sheer range of tasks on which BERT excelled gave researchers strong evidence of its generality. It wasn’t just succeeding in isolated tests; it was also handling grammar, semantics, inference, and context (all within a single, reusable framework). That level of cross-task adaptability hadn’t been achieved before with such minimal task-specific tweaking. From an evaluation standpoint, that flexibility was just as important as the high scores themselves.

Yet BERT was far from perfect. One of its biggest limitations was baked into how it was trained. To teach BERT to guess missing words, the researchers introduced an artificial “[MASK]” token into the training data (essentially a placeholder that signaled where a word had been removed). But during fine-tuning or deployment, that token never appears. This creates a mismatch: the model is trained with a crutch that it doesn’t get to use in the real world. As a result, its predictions may be slightly skewed in ways that don’t always align with natural language use.

Another practical challenge was scale. BERT’s best-performing versions were massive, with hundreds of millions of parameters. Training them required specialized hardware (like TPUs) and many days of processing power, something only large tech companies or research institutions could afford. That put BERT out of reach for smaller teams or applications with tight computational budgets.

Looking ahead, there are several directions for future work. One is interpretability—understanding how BERT makes decisions, not just whether those decisions are correct. This is crucial in areas like healthcare or law, where the “why” matters as much as the “what.” Another is domain adaptation: training models like BERT to specialize in particular industries or languages without needing to start from scratch. Finally, there is how to make BERT more efficient (through model compression, smarter training techniques, or newer architectures that could deliver similar performance with fewer resources).

Despite these challenges, BERT’s broader impact was immediate and far-reaching. It changed the way researchers approached language understanding. It laid the groundwork for a new wave of models that adopted and refined its techniques. And most importantly, it brought machine learning (ML) a step closer to handling the complexity of human communication… something that had long been a dream but, with BERT, began to feel within reach.

Further Reading