Publication
Çağlar Bozkurt
The Same Sequence Signal Can Mean Opposite Things

One of the easiest mistakes in de novo binder design is assuming that a useful signal in one setting should also help in another.
That sounds reasonable. It is also often wrong.
In our latest preprint, we re-analysed two public binder benchmarks and found that some biology-informed sequence signals do transfer across settings — but some of the most interesting ones flip direction depending on how the binder is deployed.
That is the core idea of this post: the same sequence signal can help in one context and hurt in another.
This matters because binder evaluation is often treated like a universal ranking problem. A design looks structurally plausible, seems likely to bind, gets a strong score, and is pushed forward. But experimental success depends on more than that. A binder is not just trying to bind a target. It is being asked to function in a specific architecture, assay, and biological setting.
Two Contexts, Two Different Jobs
In the paper, we looked at two public datasets with very different constraints.
The first was the Bits to Binders CAR-T CD20 benchmark, where the binder is part of a membrane-displayed CAR construct in primary human T cells. In that setting, the binder has to work inside a larger receptor system and pass multiple experimental gates, including expression-related recovery, enrichment, and depletion.
The second was the Adaptyv EGFR benchmark, where the binder is tested as a standalone protein using cell-free expression and bio-layer interferometry. There, the binder must fold independently and engage a large extracellular target surface without the same receptor-context constraints.
These are not minor differences. They change what success looks like.
A binder embedded in a CAR is solving a different problem from a binder that must stand alone as a compact folded unit.

Not Everything Transfers Cleanly
Some signals did transfer.
Lower aggregation propensity was the most robust shared signal across both benchmarks. Predicted PTM-site density also showed the same univariate direction in both datasets, although that result was less clean in EGFR because of sequence-length confounding.
But the more interesting finding was that several feature families were significant in both datasets and still reversed direction. The paper grouped these as architecture-dependent signals. The main examples were topology-like character, disorder, and disulfide-related features.
That is where universal screening logic starts to break down.
Example 1: Disorder
In the EGFR benchmark, low disorder was one of the strongest signals associated with binding success. Successful binders there looked more ordered and more compatible with independently folded standalone structures.
In the CAR-T benchmark, the story was different. Disorder-related effects did not point in the same direction. The paper specifically highlights a context where C-terminal disorder is consistent with the different architectural logic of a hinge-linked, membrane-displayed binder rather than a compact standalone fold.
So disorder is not simply “good” or “bad.”
Its meaning depends on the job the sequence is being asked to do.
Example 2: Topology-like Character
A similar flip appeared in topology-like descriptors.
In the CAR-T setting, outside-like character aligned better with enrichment success. In the EGFR setting, successful binders were more consistent with compact, globular, standalone fold logic.
This is exactly the sort of thing a universal ranking system can miss. A feature that looks favorable in one deployment setting can become neutral or actively harmful in another.
Example 3: Disulfide-related Signals
Disulfide-related features also changed meaning across contexts.
In the EGFR benchmark, they were associated with success, which is consistent with stabilizing standalone architectures. In the CAR-T context, the same class of feature looked more like a risk factor, consistent with a possible misfolding liability in a multi-domain receptor format.
Again, the point is not that disulfides are universally helpful or harmful.
The point is that the interpretation changes with context.
Why This Matters
A lot of binder evaluation still assumes that stronger-looking features can be ranked globally.
More compact-looking sequence. Better.
More stability-like signal. Better.
More of some favorable descriptor. Better.
But this paper suggests a more realistic framing: the same descriptor can carry different implications in different deployment contexts.
That is why we grouped the signals into three layers:
transferable
architecture-dependent
context-specific
That framework is more useful than pretending every binder problem can be solved with one universal score.
A Better Screening Mindset
This does not mean structure should be ignored.
It does not mean affinity is irrelevant.
And it does not mean every context-specific feature should be overinterpreted.
It means screening should be layered.
Start with signals that look broadly transferable, such as aggregation-related risk. Then adapt the next layer to the architecture. Then consider context-specific features tied to the assay or deployment environment. That is the screening logic outlined in the paper.
In the CAR-T benchmark, a retrospective biology-informed filter stack increased hit rate from 13.8% to 38.6%, a 2.8× lift. But importantly, that full stack did not transfer cleanly to EGFR. Only the more universal layer did. That is not a flaw in the result. It is the point of the result.

The Broader Takeaway
The main message of the paper is simple: binder success depends on deployment context.
Not just target.
Not just affinity.
Not just folding confidence.
Context.
How the binder is expressed, what architecture it sits inside, what kind of fold it needs, what experimental gate it is passing through, and what biological compatibility it needs — all of that shapes whether a signal helps, hurts, or means nothing at all.
That is why we think de novo binder evaluation should move away from universal ranking logic and toward context-aware, multi-gate screening.
Important Limitation
This study is a retrospective re-analysis of public benchmark data. The proposed framework is hypothesis-generating, not a finished production rulebook, and still needs prospective validation.
But even at this stage, the message is useful: before asking whether a signal is good, ask what context it is good in.
You can read the full preprint here: https://www.biorxiv.org/content/10.64898/2026.04.13.718094v1