BERT Feature Embeddings Fused With Physiochemical Feature Representations Yields More Accurate and Versatile Model for Prediction of Blood Brain Barrier Penetrating Peptides

MYTHREYA DHARANI; Iosif Vaisman

doi:10.13021/jssr2023.3983

Authors

MYTHREYA DHARANI School of Systems Biology, George Mason University, Fairfax, VA
Iosif Vaisman School of Systems Biology, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2023.3983

Abstract

As the largest site for blood-brain interchange, the blood-brain barrier (BBB) is distinctly composed to limit fenestrations, which allows for regulation of neurotransmitters, plasma macromolecules, neurotoxins, and unwanted substances in general from entering the central nervous system (CNS). Although peptides were historically thought of as not being permeable to the BBB, recent research has established that several can cross, and are now of interest as drugs and treatment conjugates. In-silico and laboratory methods can be inaccurate, expensive, or time consuming. Machine learning methods are a promising alternative, but current approaches neglect to utilize the contextual information in the peptide sequences themselves. Thus, a novel algorithm considering both the protein sequence embeddings and the physicochemical features of the peptides themselves was developed. Sequence embeddings were constructed using ProteinBERT, a bidirectional transformer model, and five physicochemical feature vectors were generated with multiple scripts and database extractions. A multi-layer perceptron was trained on a dataset consisting of around 500 peptides, filtered for similarity by CD-HIT. The model had an 84% accuracy, outperforming machine learning algorithms for BBB peptide permeability prediction. Thus, this model has widespread implications, and can be utilized for efficient screening of peptide development, in order to create effective drugs for neurological conditions.