Designing molecules for targets lacking abundant data or 3D structures remains a bottleneck, as most models rely on known binders or pocket geometries. To address this, we present PharMistral, a sequence-conditioned ligand generator based on Mistral-7B. Unlike DeepTarget (task-specific) or ChemGPT (SMILES-only), PharMistral adapts a single 7B-parameter model via a two-stage process: (1) joint pre-training on more than 300,000 unpaired human proteins and more than 1 million drug-like SMILES to learn a shared biochemical language; and (2) end-to-end fine-tuning on about 300,000 high-affinity human protein–ligand pairs curated from ChEMBL-26. At inference, PharMistral autoregressively generates ligands from raw protein sequences without target-specific retraining or structural input. On unseen proteins, the model achieved 99.5% chemical validity. Of valid molecules, 57% were unique, and 36% of those were novel compared to the training set. After drug-likeness and toxicity filtering, 3383 unique molecules were retained (55% of the unique valid set and 31% of all valid molecules).
Bendjeddou, A., Zeinalipour, K., Bardou, D., Maggini, M., Scarselli, F., Bianchini, M. (2026). PharMistral: Drug generation for novel protein targets by Mistral Large Language Model. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE UPDATE, 10 [10.1016/j.cmpbup.2026.100257].
PharMistral: Drug generation for novel protein targets by Mistral Large Language Model
Bendjeddou, Asma;Maggini, Marco;Scarselli, Franco;Bianchini, Monica
2026-01-01
Abstract
Designing molecules for targets lacking abundant data or 3D structures remains a bottleneck, as most models rely on known binders or pocket geometries. To address this, we present PharMistral, a sequence-conditioned ligand generator based on Mistral-7B. Unlike DeepTarget (task-specific) or ChemGPT (SMILES-only), PharMistral adapts a single 7B-parameter model via a two-stage process: (1) joint pre-training on more than 300,000 unpaired human proteins and more than 1 million drug-like SMILES to learn a shared biochemical language; and (2) end-to-end fine-tuning on about 300,000 high-affinity human protein–ligand pairs curated from ChEMBL-26. At inference, PharMistral autoregressively generates ligands from raw protein sequences without target-specific retraining or structural input. On unseen proteins, the model achieved 99.5% chemical validity. Of valid molecules, 57% were unique, and 36% of those were novel compared to the training set. After drug-likeness and toxicity filtering, 3383 unique molecules were retained (55% of the unique valid set and 31% of all valid molecules).| File | Dimensione | Formato | |
|---|---|---|---|
|
PharMistral.pdf
accesso aperto
Tipologia:
PDF editoriale
Licenza:
Creative commons
Dimensione
3.17 MB
Formato
Adobe PDF
|
3.17 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1320174
