Blog

Protein Language Models Explained for Bench Scientists

Apr 3, 2026

Your colleague mentions that their new prediction tool uses "ESM-2 embeddings." Your PI asks whether you should use "protein language models" for variant analysis. A preprint claims that "masked language modeling" predicts mutation effects better than Rosetta. You nod along in the lab meeting, but privately you have no idea what any of this actually means, whether it's hype, or whether you should care.


You should care. Protein language models are quietly reshaping how we predict structure, function, and mutation effects—but the field has done a remarkably poor job explaining them to the bench scientists who actually use proteins.

Key Takeaways

  • Protein language models (PLMs) learn evolutionary patterns from millions of protein sequences without any labels or supervision

  • ESM-2 is the most widely used PLM: 650 million parameters trained on ~250 million sequences, producing per-residue "embeddings" that encode evolutionary and structural information

  • PLMs complement AlphaFold, not replace it: AlphaFold predicts 3D coordinates; PLMs provide rich sequence-level representations useful for variant scoring, function prediction, and engineering

  • "Embeddings" are high-dimensional fingerprints: a 1,280-number vector per residue that captures what the model learned about that position's evolutionary context

  • PLMs are genuinely useful but not magic: they excel at variant effect prediction and transfer learning for small datasets, but can't predict binding affinity or replace experimental validation

The Core Analogy: Protein Sequences as Language

Why the Analogy Works

Natural language processing (NLP) revolutionized text understanding by treating words as elements in a structured language with grammar and meaning. The insight behind protein language models is that protein sequences behave similarly:

NLP Concept

Protein Equivalent

Words

Amino acids (20-letter alphabet)

Sentences

Protein sequences

Grammar

Physical and evolutionary constraints on sequence

Meaning

Structure, function, interactions

Dialects

Protein families with conserved patterns

Typos

Deleterious mutations

Synonyms

Conservative substitutions (V↔I, D↔E)

The key parallel: Just as a language model learns that "The cat sat on the ___" should be completed with "mat" or "chair" (not "democracy"), a protein language model learns that position 145 in a kinase should be a hydrophobic residue (not an aspartate)—without being told anything about kinases, hydrophobicity, or protein folding.

What the Model Actually Learns

When you train a language model on billions of words of English, it learns grammar, semantics, and world knowledge—all implicitly. Similarly, when you train a protein language model on hundreds of millions of protein sequences, it implicitly learns (Rives et al., 2021):

  • Conservation patterns: Which residues are tolerated at each position in a protein family

  • Co-evolution: Which positions change together (often because they're in contact)

  • Structural context: Buried positions prefer hydrophobic residues; surface positions prefer hydrophilic ones

  • Functional constraints: Active site residues are more conserved than surface residues

  • Folding requirements: Secondary structure preferences, turn propensities, disorder signals


None of this is programmed in. The model discovers it from data alone.

What Is a Protein Language Model?

Self-Supervised Learning on Sequences

The central idea is self-supervised learning: you train the model on a vast amount of data without manual labels.


The training process:

  1. Collect sequences: Gather hundreds of millions of protein sequences from databases like UniRef

  2. Mask residues: Randomly hide ~15% of amino acids in each sequence

  3. Predict the masked residues: Train the model to predict what was hidden, based on the surrounding context

  4. Repeat: Process billions of examples until the model gets very good at prediction


This is called masked language modeling (MLM)—the same approach that powers BERT in NLP.


What the model learns from this simple task:


To predict a masked residue accurately, the model must understand:

  • What amino acids are allowed at this position (conservation)

  • How neighboring residues constrain the choice (local structure)

  • How distant residues influence this position (long-range contacts, coevolution)

  • Whether this position is buried or exposed (hydrophobicity patterns)

  • Whether this position is in a helix, strand, or loop (secondary structure preferences)


All of this emerges from the single task of predicting masked residues.

Architecture: The Transformer

PLMs use the transformer architecture—the same design behind GPT and BERT (Vaswani et al., 2017). The key mechanism is attention: for each residue, the model computes how much every other residue in the sequence matters for understanding it.


Why this matters for proteins:

  • A residue in the active site might "attend to" other active site residues 200 positions away

  • Residues in contact in 3D space often attend to each other, even if distant in sequence

  • The attention patterns implicitly encode contact maps and structural relationships


You don't need to understand the math. The practical takeaway: transformers are architecturally well-suited to proteins because they can capture long-range dependencies in sequences—exactly what proteins have.

ESM-2: The Workhorse Model

What It Is

ESM-2 (Evolutionary Scale Modeling 2) is Meta AI's protein language model (Lin et al., 2023). It's the most widely used PLM in both academic and industrial protein science.

Key specifications:

Property

Value

Developer

Meta AI (FAIR)

Parameters

650 million (most common variant)

Training data

~250 million sequences from UniRef

Input

Single amino acid sequence

Output

Per-residue embeddings (1,280 dimensions)

Inference time

Seconds for a typical protein

Availability

Open source (MIT license)

Model size variants:

Variant

Parameters

Embedding Dim

Best For

ESM-2 (8M)

8 million

320

Quick prototyping

ESM-2 (35M)

35 million

480

Moderate accuracy, fast

ESM-2 (150M)

150 million

640

Good balance

ESM-2 (650M)

650 million

1,280

Best accuracy (standard choice)

ESM-2 (3B)

3 billion

2,560

Marginal improvement, much slower

ESM-2 (15B)

15 billion

5,120

Diminishing returns

The 650M parameter model is the standard. Larger models provide marginal improvements that rarely justify the computational cost.

Why ESM-2 Matters

ESM-2 is important because it provides a universal protein representation that can be used for almost any downstream task:

  • No alignment needed: Unlike methods that require multiple sequence alignments (MSAs), ESM-2 takes a single sequence as input

  • Fast: Inference takes seconds, not the minutes/hours required for MSA construction + AlphaFold

  • Transferable: The same embeddings work for function prediction, stability prediction, variant scoring, and more

  • Captures evolution implicitly: 250 million sequences' worth of evolutionary information is compressed into the model's weights

What "Embeddings" Actually Are

The Intuition

When ESM-2 processes a protein sequence, it produces a 1,280-dimensional vector for each residue. This vector is called an "embedding."


Analogy: Think of it as a fingerprint for each residue's evolutionary and structural context. Just as a fingerprint uniquely identifies a person, an embedding uniquely characterizes what the model "knows" about that position in that protein.


What's encoded in those 1,280 numbers:

  • Conservation pattern at this position

  • Secondary structure propensity

  • Buried vs exposed character

  • Proximity to functional sites

  • Co-evolutionary relationships with other positions

  • Whether the position is in a structured or disordered region


You can't look at the 1,280 numbers and read off "this residue is in a helix." But when you train a simple classifier on top of the embeddings, it can predict secondary structure with ~85% accuracy—the information is there, just encoded in a distributed way.

How Embeddings Are Used

Direct use (zero-shot):

  • Predict mutation effects by comparing the model's prediction at a position with the actual amino acid

  • If the model says "this position should be valine" and you mutate it to aspartate, that's likely deleterious


Transfer learning (fine-tuned):

  • Take the embeddings as input features

  • Train a small neural network on top for a specific task (e.g., predicting ΔΔG)

  • Works well even with small training datasets (hundreds of examples)

  • This is how most practical tools use ESM-2


Representation learning:

  • Use embeddings as features for clustering, visualization, or search

  • Similar proteins have similar embeddings

  • Enables protein "search by meaning" rather than by sequence identity

What "Masked Language Modeling" Means in Practice

The Training Task

During training, ESM-2 sees a sequence like:

For each masked position, the model outputs a probability distribution over all 20 amino acids. The training loss pushes the model to assign high probability to the correct amino acid.

The Practical Implication

After training, you can use this prediction capability directly:


Zero-shot variant scoring:

  1. Feed your wild-type sequence to ESM-2

  2. For each position, the model outputs probabilities for all 20 amino acids

  3. The wild-type amino acid usually has high probability (the model "agrees" with evolution)

  4. A mutation to an amino acid with low probability is likely deleterious

  5. A mutation to an amino acid with moderate probability may be tolerated


The score: log(P_mutant / P_wildtype) gives a variant effect score. Negative = likely deleterious. Positive = possibly beneficial or neutral.


Meier et al. (2021) showed that this zero-shot approach—without ANY training on experimental mutation data—correlates well with deep mutational scanning experiments across multiple proteins.

What PLMs Can Predict (And How Well)

Tasks Where PLMs Excel

Task

How It Works

Accuracy

Reference

Secondary structure

Fine-tune on labeled data

~85% per-residue (Q3)

Rives et al., 2021

Contact prediction

Attention maps → contacts

Competitive with coevolution methods

Rao et al., 2021

Variant effect prediction

Zero-shot or fine-tuned

Spearman ρ ~0.4–0.5 with DMS data

Meier et al., 2021

Subcellular localization

Fine-tune on labeled data

~90% accuracy

Multiple studies

Function (GO terms)

Fine-tune on labeled data

AUC ~0.8–0.9 depending on category

Multiple studies

Disorder prediction

Fine-tune or zero-shot

Comparable to dedicated tools

Ruff & Pappu, 2021

Tasks Where PLMs Struggle

Task

Why It's Hard

Better Alternative

Precise 3D structure

Sequence alone lacks 3D information

AlphaFold (uses MSAs + structure modules)

Absolute binding affinity

Affinity depends on 3D geometry, electrostatics, solvation

Physics-based methods, experimental measurement

Protein-protein interaction prediction

Requires modeling two sequences jointly

AlphaFold-Multimer, coevolution methods

Small molecule interactions

PLMs see only amino acids, not ligands

Docking, molecular dynamics

Post-translational modifications

Some PTMs have weak sequence signals

Combined approaches (sequence + structure)

Protein dynamics

Single representation, no conformational sampling

Molecular dynamics, NMR

The Honest Assessment

PLMs are genuinely useful for variant scoring, function prediction, and as input features for downstream models. They are NOT magical—they capture evolutionary patterns, not physics. If a property depends on structural details, electrostatics, or dynamics, PLMs alone won't get you there.

PLMs vs AlphaFold: Complementary, Not Competing

This is the most common source of confusion. PLMs and AlphaFold are different tools that solve different problems:

Feature

Protein Language Models (ESM-2)

AlphaFold2

Input

Single sequence

Sequence + MSA + templates

Output

Per-residue embeddings (1,280D vectors)

3D atomic coordinates

Speed

Seconds

Minutes to hours

What it captures

Evolutionary patterns, sequence-level features

3D structure, spatial relationships

MSA required?

No

Yes (critical for accuracy)

Best for

Variant scoring, function prediction, search

Structure prediction, docking, interface analysis

Limitations

No 3D structure, no physics

Slow, requires MSA, single conformation

How they work together:

  • ESM-2 provides fast, universal sequence representations

  • AlphaFold provides detailed 3D structural models

  • Many modern tools combine both: ESM-2 embeddings as features + AlphaFold structure as context

  • ESMFold uses ESM-2 embeddings to predict structure directly (faster than AlphaFold but less accurate)

Comparison: ESM-2 vs ProtTrans vs AlphaFold Embeddings

Model

Source

Key Strengths

Common Uses

ESM-2

Meta AI

Largest training set, best overall performance, open source

Variant effect, function prediction, stability

ProtTrans (ProtT5)

TU Munich

Competitive performance, T5 architecture

Function prediction, localization

AlphaFold embeddings

DeepMind

Structure-aware, captures 3D relationships

Structure-dependent predictions

ESM-1v

Meta AI

Specifically designed for variant effects

Mutation scanning, clinical variant interpretation

For most applications, ESM-2 (650M) is the default choice. It's well-validated, widely supported, and open source. ProtT5 is a reasonable alternative with similar performance. AlphaFold embeddings add structural information but are slower to compute.

How PLMs Are Changing Practical Protein Science

Variant Effect Prediction

The old way: Run Rosetta or FoldX on every possible mutation. Requires a structure. Slow. Physics assumptions may be wrong.


The PLM way: Run ESM-2 zero-shot scoring on every possible mutation. No structure needed. Fast (minutes for entire saturation mutagenesis). Captures evolutionary constraints that physics-based methods miss.


The best way: Combine both. Use PLM scores as one feature alongside ΔΔG from FoldX/Rosetta. Combined models outperform either alone.

Zero-Shot Mutation Scanning

For any protein, you can generate a "mutation landscape" in minutes:

  1. Feed the sequence to ESM-2

  2. For every position, get the model's predictions for all 20 amino acids

  3. Score each possible mutation by log-likelihood ratio

  4. Visualize as a heatmap: positions (rows) × amino acids (columns)


This gives you an immediate sense of which positions are mutationally tolerant and which are constrained—without any experimental data.

Transfer Learning for Small Datasets

The problem: You have 50 experimentally measured ΔTm values for your protein. Training a machine learning model on 50 data points usually gives terrible results.


The PLM solution: Use ESM-2 embeddings as input features. The embeddings already encode evolutionary information from 250 million sequences. Your small dataset just needs to teach the model how to map from those rich features to ΔTm. This is transfer learning, and it works remarkably well.


Studies show that ESM-2-based models can achieve useful accuracy with as few as 50–100 training examples—a fraction of what's needed for models trained from scratch.

Protein Engineering with ProteinMPNN

ProteinMPNN (Dauparas et al., 2022) takes the language model concept further: given a protein backbone structure, it designs sequences that would fold into that structure. This is structure-conditioned language modeling—the model generates "protein language" that is consistent with a given 3D arrangement.


Practical applications:

  • Redesign protein surfaces for improved solubility

  • Design sequences for de novo protein structures

  • Generate diverse sequences for directed evolution starting libraries

The Hype vs Reality Check

What the Preprints Claim vs What Works in the Lab

Claim

Reality

Verdict

"PLMs predict mutation effects"

Zero-shot scores correlate with DMS data (ρ ~0.4–0.5). Useful for ranking, not for precise ΔΔG prediction.

Partially true. Good for prioritization, not precise quantification.

"ESM-2 understands protein structure"

Embeddings encode structural features implicitly. ESMFold predicts structure but less accurately than AlphaFold.

Partially true. Captures structural patterns but isn't a structure predictor.

"PLMs replace sequence alignments"

For some tasks (variant scoring), single-sequence PLMs match MSA-based methods. For structure prediction, MSAs are still superior.

Context-dependent. Faster, comparable for some tasks, worse for others.

"PLMs will design better proteins"

ProteinMPNN designs have high experimental success rates (~50–70% fold correctly). But design is not the same as optimization.

Promising but early. Laboratory validation still required.

"Bigger models are always better"

ESM-2 (650M) to ESM-2 (15B) shows diminishing returns on most benchmarks.

Mostly false. The 650M model is the sweet spot.

How to Evaluate Whether a PLM-Based Tool Is Trustworthy

Questions to Ask

  1. What was it trained on? Tools trained on data similar to your protein of interest will perform better. Ask about the training set composition.

  2. Was it validated on held-out data? Any tool can overfit its training set. Look for performance on proteins the model never saw during training.

  3. Does it report uncertainty? Good models tell you when they're confident and when they're guessing. If a tool gives you a single number with no error bar, be skeptical.

  4. Has it been tested experimentally? Computational benchmarks are necessary but not sufficient. Look for papers where PLM predictions were validated in the lab.

  5. Is it appropriate for your protein? PLMs trained primarily on globular, soluble proteins may perform poorly on membrane proteins, IDPs, or viral proteins. Check whether your protein type was represented in training.

Red Flags

  • Claims of >90% accuracy without specifying what "accuracy" means

  • No comparison to simple baselines (e.g., conservation-based methods)

  • Trained and tested on the same dataset

  • No error bars or uncertainty estimates

  • Works "on all proteins" with no caveats

The Bottom Line

Question

Answer

"What is a protein language model?"

A neural network trained on millions of sequences to learn evolutionary patterns, without manual labels

"What is ESM-2?"

Meta's 650M-parameter PLM, the most widely used model for protein representation

"What are embeddings?"

1,280-number fingerprints per residue that encode everything the model learned about that position

"What is masked language modeling?"

The training task: hide residues, predict them back. Forces the model to learn sequence patterns.

"Do PLMs replace AlphaFold?"

No. PLMs give sequence-level representations; AlphaFold gives 3D structures. They're complementary.

"Should I use PLM-based tools?"

Yes, for variant prioritization and function prediction. No, as a replacement for experimental validation.

"Are bigger models better?"

Marginally. ESM-2 (650M) is the practical sweet spot.

Protein language models are the most significant methodological advance in computational protein science since AlphaFold. They won't replace your experiments, but they'll help you design better ones.

References

  1. Rives A, et al. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 118(15):e2016239118. Link

  2. Lin Z, et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123-1130. Link

  3. Meier J, et al. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS, 35. PMC8886682

  4. Dauparas J, et al. (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science, 378(6615):49-56. Link

  5. Vaswani A, et al. (2017). Attention is all you need. NeurIPS, 30. Link

  6. Rao RM, et al. (2021). MSA Transformer. ICML, 139:8844-8856. Link

  7. Ruff KM & Pappu RV. (2021). AlphaFold and implications for intrinsically disordered proteins. Journal of Molecular Biology, 433(20):167208. Link

  8. Elnaggar A, et al. (2022). ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112-7127. Link

  9. Notin P, et al. (2022). Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. ICML. PMC9477541

  10. Hsu C, et al. (2022). Learning inverse folding from millions of predicted structures. ICML. Link

  11. Frazer J, et al. (2021). Disease variant prediction with deep generative models of evolutionary data. Nature, 599:91-95. Link