ESMBind (ESMB) 整合模型
內容提要: 在這篇文章中,我們將討論如何使用 ESMBind (ESMB) 模型構建一個基本的整合模型。我們將採用“硬”投票和“軟”投票兩種策略。我們將向您展示如何在一個預處理過的、已分割為訓練/測試集的蛋白質序列資料集上計算訓練和測試指標。請注意,以下內容純粹用於演示目的。這些模型未經充分測試,並且似乎存在過擬合(請參見下文的精確度、F1 分數和 MCC)。
介紹
請注意,由於使用整合模型帶來的記憶體限制,您可能需要在本地或 Google Colab Pro 例項上執行此程式碼示例。您也可以嘗試使用 P100 GPU 的 Kaggle notebook。另一種選擇是,使用我們之前的文章,以 esm2_t6_8M_UR50D 為基礎模型,訓練兩個或更多個較小的 ESMB 模型。 回想一下,在這篇文章中,我們展示瞭如何使用低秩適應 (LoRA) 來微調一個結合位點預測器。我們會在這裡回顧一些資訊,但除非您已經熟悉 LoRA 和整合模型的基礎知識,否則在繼續之前最好先閱讀那篇文章。此外,請注意本文純粹用於演示目的。為了獲得更好的整合模型,您應該使用上一篇文章中給出的示例,並採用不同的超引數來訓練您自己的模型。
ESMBind(或 ESMB)是一系列微調模型的集合,它們在基礎模型 ESM-2 之上使用低秩適應 (LoRA) 進行微調,旨在僅基於單個蛋白質的序列來預測其結合位點。它不需要多序列比對 (MSA) 或任何關於蛋白質 3D 摺疊或主鏈結構的結構資訊。這使得 ESMB 模型易於訪問、使用簡單,並且應用和理解所需的領域知識較少,從而更具可解釋性。然而,這可能會以犧牲效能為代價。
請記住,我們在上面連結的文章中展示瞭如何對蛋白質語言模型 (pLM) ESM-2 使用低秩適應 (LoRA)。LoRA 是一種技術,已被證明可以顯著改善 pLM esm2_t12_35M_UR50D 的過擬合問題(另請參閱 Hugging Face 上的 ESM)。這也使我們能夠以引數高效的方式微調更大的模型。下面,我們將為您提供程式碼,既可以在用於整合中單個模型訓練/測試分割的預處理資料集上獲取訓練/測試指標,也可以在您自己的蛋白質序列上執行推理。
訓練/測試資料集
在開始之前,請下載以下 pickle 檔案,然後在程式碼中調整下面的路徑以匹配您的本地檔案路徑。
在一個大型預處理資料集上獲取訓練/測試指標
注意,這段程式碼可以在 Google Colab 或 Kaggle 例項中執行。但是,如果使用 Colab 提供的免費 GPU,程式碼的第一部分需要幾個小時才能執行完畢。用於在單個蛋白質序列或少量蛋白質上測試整合模型的推理部分程式碼,執行時間僅需幾秒鐘。因此,如果您只想在少數蛋白質序列上測試模型,可以跳到最後一節“推理”。
步驟 0:安裝和匯入
!pip install transformers -q
!pip install datasets -q
!pip install accelerate -q
!pip install scipy -q
!pip install scikit-learn -q
!pip install peft -q
import os
import pickle
import numpy as np
from scipy import stats
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score, matthews_corrcoef
from transformers import AutoModelForTokenClassification, Trainer, AutoTokenizer, DataCollatorForTokenClassification
from datasets import Dataset, concatenate_datasets
from accelerate import Accelerator
from peft import PeftModel
import gc
步驟 1:載入資料
在這一步,您將從 pickle 檔案中載入訓練和測試資料集的序列及標籤。這些資料集將分別用於訓練和評估您的模型。
# Step 1: Load train/test data and labels from pickle files
with open("/content/drive/MyDrive/train_sequences_chunked_by_family.pkl", "rb") as f:
train_sequences = pickle.load(f)
with open("/content/drive/MyDrive/test_sequences_chunked_by_family.pkl", "rb") as f:
test_sequences = pickle.load(f)
with open("/content/drive/MyDrive/train_labels_chunked_by_family.pkl", "rb") as f:
train_labels = pickle.load(f)
with open("/content/drive/MyDrive/test_labels_chunked_by_family.pkl", "rb") as f:
test_labels = pickle.load(f)
步驟 2:批次分詞和資料集建立
在這一步,使用預訓練的分詞器對序列進行分詞。分詞是將輸入文字轉換為標記(即整數值)的過程。然後,分詞後的序列和標籤被用來建立資料集。
# Step 2: Define the Tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t12_35M_UR50D")
max_sequence_length = tokenizer.model_max_length
步驟 3:分批計算指標以節省記憶體
# Step 3: Define a `compute_metrics_for_batch` function.
def compute_metrics_for_batch(sequences_batch, labels_batch, models, voting='hard'):
# Tokenize batch
batch_tokenized = tokenizer(sequences_batch, padding=True, truncation=True, max_length=max_sequence_length, return_tensors="pt", is_split_into_words=False)
batch_dataset = Dataset.from_dict({k: v for k, v in batch_tokenized.items()})
batch_dataset = batch_dataset.add_column("labels", labels_batch[:len(batch_dataset)])
# Convert labels to numpy array of shape (1000, 1002)
labels_array = np.array([np.pad(label, (0, 1002 - len(label)), constant_values=-100) for label in batch_dataset["labels"]])
# Initialize a trainer for each model
data_collator = DataCollatorForTokenClassification(tokenizer)
trainers = [Trainer(model=model, data_collator=data_collator) for model in models]
# Get the predictions from each model
all_predictions = [trainer.predict(test_dataset=batch_dataset)[0] for trainer in trainers]
if voting == 'hard':
# Hard voting
hard_predictions = [np.argmax(predictions, axis=2) for predictions in all_predictions]
ensemble_predictions = stats.mode(hard_predictions, axis=0)[0][0]
elif voting == 'soft':
# Soft voting
avg_predictions = np.mean(all_predictions, axis=0)
ensemble_predictions = np.argmax(avg_predictions, axis=2)
else:
raise ValueError("Voting must be either 'hard' or 'soft'")
print("Shape of ensemble_predictions:", ensemble_predictions.shape) # Debug print
# Use broadcasting to create 2D mask
mask_2d = labels_array != -100
# Filter true labels and predictions using the mask
true_labels_list = [label[mask_2d[idx]] for idx, label in enumerate(labels_array)]
true_labels = np.concatenate(true_labels_list)
flat_predictions_list = [ensemble_predictions[idx][mask_2d[idx]] for idx in range(ensemble_predictions.shape[0])]
flat_predictions = np.concatenate(flat_predictions_list).tolist()
# Compute the metrics
accuracy = accuracy_score(true_labels, flat_predictions)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, flat_predictions, average='binary')
auc = roc_auc_score(true_labels, flat_predictions)
mcc = matthews_corrcoef(true_labels, flat_predictions) # Compute MCC
return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1, "auc": auc, "mcc": mcc}
步驟 4:定義一個函式以分批進行評估
#Step 4: Evaluate in Batches
def evaluate_in_batches(sequences, labels, models, voting='hard', batch_size=1000):
num_batches = len(sequences) // batch_size + int(len(sequences) % batch_size != 0)
metrics_list = []
for i in range(num_batches):
start_idx = i * batch_size
end_idx = start_idx + batch_size
batch_metrics = compute_metrics_for_batch(sequences[start_idx:end_idx], labels[start_idx:end_idx], models, voting)
# Print metrics for the first five batches
if i < 5:
print(f"Batch {i+1}/{num_batches} metrics: {batch_metrics}")
metrics_list.append(batch_metrics)
# Average metrics over all batches
avg_metrics = {key: np.mean([metrics[key] for metrics in metrics_list]) for key in metrics_list[0]}
return avg_metrics
步驟 5:定義整合模型
# Load pre-trained base model and fine-tuned LoRA models
accelerator = Accelerator()
base_model_path = "facebook/esm2_t12_35M_UR50D"
base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)
lora_model_paths = [
"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_cp1",
"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp1",
# Add more models or swap out for your own models
]
models = [PeftModel.from_pretrained(base_model, path) for path in lora_model_paths]
models = [accelerator.prepare(model) for model in models]
步驟 6:整合投票和指標計算
# Step 5: Compute and print the metrics
train_metrics_hard = evaluate_in_batches(train_sequences, train_labels, models, "train", voting='hard')
test_metrics_hard = evaluate_in_batches(test_sequences, test_labels, models, "test", voting='hard')
train_metrics_soft = evaluate_in_batches(train_sequences, train_labels, models, "train", voting='soft')
test_metrics_soft = evaluate_in_batches(test_sequences, test_labels, models, "test", voting='soft')
train_metrics_hard, test_metrics_hard, train_metrics_soft, test_metrics_soft
然後,這將打印出類似以下內容:
train - Batch 1/451 metrics: {'accuracy': 0.9907783025067246, 'precision': 0.7792440817271516, 'recall': 0.9714265098491954, 'f1': 0.8647867420349434, 'auc': 0.9814053346312887, 'mcc': 0.8656769123429833}
train - Batch 2/451 metrics: {'accuracy': 0.9906862419735746, 'precision': 0.7686626071267478, 'recall': 0.9822046109510086, 'f1': 0.8624114372469636, 'auc': 0.9865753167670478, 'mcc': 0.8645747724704963}
train - Batch 3/451 metrics: {'accuracy': 0.9907034630406232, 'precision': 0.7662082514734774, 'recall': 0.9884141926140478, 'f1': 0.8632411067193676, 'auc': 0.9895938451445732, 'mcc': 0.8659743174909746}
train - Batch 4/451 metrics: {'accuracy': 0.991028787153535, 'precision': 0.7751964275620372, 'recall': 0.9881115354132142, 'f1': 0.8687994931897371, 'auc': 0.9896153675458282, 'mcc': 0.871052392709521}
train - Batch 5/451 metrics: {'accuracy': 0.9901174908557153, 'precision': 0.7585922916437905, 'recall': 0.9865762227775794, 'f1': 0.8576926658183058, 'auc': 0.988401969496207, 'mcc': 0.8605718730416185}
之後,將需要漫長等待訓練批次完成,然後會打印出前五個測試批次的指標,這些指標將與訓練指標相似。
test - Batch 1/114 metrics: {'accuracy': 0.9410464672512716, 'precision': 0.37514282087088996, 'recall': 0.8439481350317016, 'f1': 0.5194051887787388, 'auc': 0.8944018149939027, 'mcc': 0.5392923907809524}
test - Batch 2/114 metrics: {'accuracy': 0.938214353140821, 'precision': 0.361414131305044, 'recall': 0.8304587788892721, 'f1': 0.5036435270736724, 'auc': 0.886450001724052, 'mcc': 0.5233747173742583}
test - Batch 3/114 metrics: {'accuracy': 0.9411384591024733, 'precision': 0.3683750578316969, 'recall': 0.8300225864365552, 'f1': 0.5102807398572268, 'auc': 0.8877119446522322, 'mcc': 0.5294666106367614}
test - Batch 4/114 metrics: {'accuracy': 0.9403683315585174, 'precision': 0.369614054572532, 'recall': 0.8394290300389818, 'f1': 0.5132402166102942, 'auc': 0.8918623875782199, 'mcc': 0.5334084101768152}
test - Batch 5/114 metrics: {'accuracy': 0.9400765476285562, 'precision': 0.37219051467245823, 'recall': 0.8356296422294041, 'f1': 0.514999563204333, 'auc': 0.8899200984461443, 'mcc': 0.5337721026971387}
對於軟投票策略,這個過程也將重複。在為軟投票策略的每個訓練和測試批次再次漫長等待後,您應該會得到所有批次的訓練和測試指標的平均值。
推理
最後,我們可以像下面的程式碼一樣,對感興趣的蛋白質進行推理。這部分程式碼可以獨立於本文中的其他程式碼執行,並且應該只需要幾秒鐘。
from transformers import AutoModelForTokenClassification, AutoTokenizer, DataCollatorForTokenClassification, Trainer
from datasets import Dataset
from peft import PeftModel
import numpy as np
from scipy import stats
# ESM-2 base model
base_model_path = "facebook/esm2_t12_35M_UR50D"
# Paths to the saved LoRA models
lora_model_paths = [
"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3",
"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_cp1",
"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp1",
# add paths to other models
]
# Load the base model
base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)
# Load the models
models = [PeftModel.from_pretrained(base_model, path) for path in lora_model_paths]
# Define the new protein sequence
new_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT"
# Step 1 and 2: Tokenization and Dataset creation
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t12_35M_UR50D")
tokenized_inputs = tokenizer(new_sequence, return_tensors="pt", truncation=True, padding=True, is_split_into_words=False)
new_dataset = Dataset.from_dict({k: v for k, v in tokenized_inputs.items()})
# Step 3: Create trainer objects for each model in the ensemble
data_collator = DataCollatorForTokenClassification(tokenizer)
trainers = [Trainer(model=model, data_collator=data_collator) for model in models]
# Step 4: Getting predictions from each model and applying voting strategies
all_predictions = [trainer.predict(test_dataset=new_dataset)[0] for trainer in trainers]
# Hard voting
hard_predictions = [np.argmax(predictions, axis=2) for predictions in all_predictions]
ensemble_predictions_hard = stats.mode(hard_predictions, axis=0)[0][0]
# Soft voting
avg_predictions = np.mean(all_predictions, axis=0)
ensemble_predictions_soft = np.argmax(avg_predictions, axis=2)
# Print the final predictions obtained using hard and soft voting
print("Hard voting predictions:", ensemble_predictions_hard)
print("Soft voting predictions:", ensemble_predictions_soft)
這將打印出類似以下內容:
Hard voting predictions: [0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0]
Soft voting predictions: [[0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0]]
在這裡,1
表示整合模型預測的結合位點,0
表示整合模型預測的非結合位點。接下來,為了得到更適合為您的蛋白質設計結合配體的資訊,請執行以下程式碼:
# Convert token IDs back to amino acid residues
residues = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"][0])
# Print the amino acid residues and their positions for binding sites using hard voting
binding_sites_hard = [(idx, residue) for idx, (label, residue) in enumerate(zip(ensemble_predictions_hard[0], residues)) if label == 1]
print("Binding sites (Hard voting):")
for position, residue in binding_sites_hard:
print(f"{residue}{position}")
# Print the amino acid residues and their positions for binding sites using soft voting
binding_sites_soft = [(idx, residue) for idx, (label, residue) in enumerate(zip(ensemble_predictions_soft[0], residues)) if label == 1]
print("\nBinding sites (Soft voting):")
for position, residue in binding_sites_soft:
print(f"{residue}{position}")
這將打印出類似以下內容:
Binding sites (Hard voting):
P8
N9
H10
I12
Y13
I14
N15
N16
L17
N18
E19
K20
K22
F34
G38
L41
L44
V45
S46
R47
S48
L49
K50
M51
R52
G53
Q54
A55
F59
Q73
G74
Y78
D79
K80
P81
M82
I84
Q85
Y86
A87
K88
T89
D90
Binding sites (Soft voting):
P8
N9
H10
I12
Y13
I14
N15
N16
L17
N18
E19
K20
K22
F34
G38
L41
L44
V45
S46
R47
S48
L49
K50
M51
R52
G53
Q54
A55
F59
Q73
G74
Y78
D79
K80
P81
M82
I84
Q85
Y86
A87
K88
T89
D90
使用 RFDiffusion 為您的蛋白質設計結合物
RFDiffusion 是一個生成 3D 蛋白質結構的擴散模型。這在概念上類似於 Stable Diffusion 和 Dall-E 等擴散模型,但它是針對蛋白質的。它的架構與 Stable Diffusion 不同(使用 RosettaFold 作為骨幹模型,而不是 Stable Diffusion 中使用的 UNet)。
一旦您獲得了結合位點的預測,您應該前往 RFDiffusion Notebook,並使用模型預測的結合位點的某個子集作為結合物的“熱點”來為您的蛋白質設計一個結合物。您首先需要一個蛋白質的 PDB 檔案。要獲取一個,請前往 ESM Metagenomic Atlas 網站上的 ESMFold 工具。選擇“Fold Sequence”,然後貼上您的蛋白質序列以進行摺疊,並按回車。一旦您的蛋白質被摺疊,您應該會得到一個 3D 結構
現在您可以下載您的 PDB 檔案了。下載後,將其上傳到 RFDiffusion Google Colab notebook,並在 RFDiffusion notebook 中使用您上傳的 PDB 檔案的路徑來為您的蛋白質設計一個結合物。使用以下設定
%%time
#@title run **RFdiffusion** to generate a backbone
name = "test" #@param {type:"string"}
contigs = "100" #@param {type:"string"}
pdb = "/content/unnamed.pdb" #@param {type:"string"}
iterations = 50 #@param ["25", "50", "100", "150", "200"] {type:"raw"}
hotspot = "A41,A44,A45,A46" #@param {type:"string"}
num_designs = 1 #@param ["1", "2", "4", "8", "16", "32"] {type:"raw"}
visual = "interactive" #@param ["none", "image", "interactive"]
#@markdown ---
#@markdown **symmetry** settings
#@markdown ---
symmetry = "cyclic" #@param ["none", "auto", "cyclic", "dihedral"]
order = 3 #@param ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"] {type:"raw"}
chains = "" #@param {type:"string"}
add_potential = True #@param {type:"boolean"}
#@markdown - `symmetry='auto'` enables automatic symmetry dectection with [AnAnaS](https://team.inria.fr/nano-d/software/ananas/).
#@markdown - `chains="A,B"` filter PDB input to these chains (may help auto-symm detector)
#@markdown - `add_potential` to discourage clashes between chains
# determine where to save
path = name
while os.path.exists(f"outputs/{path}_0.pdb"):
path = name + "_" + ''.join(random.choices(string.ascii_lowercase + string.digits, k=5))
flags = {"contigs":contigs,
"pdb":pdb,
"order":order,
"iterations":iterations,
"symmetry":symmetry,
"hotspot":hotspot,
"path":path,
"chains":chains,
"add_potential":add_potential,
"num_designs":num_designs,
"visual":visual}
for k,v in flags.items():
if isinstance(v,str):
flags[k] = v.replace("'","").replace('"','')
contigs, copies = run_diffusion(**flags)
您將得到一個像下面這樣的環狀蛋白質
您可以執行 RFDiffusion Colab notebook 的其餘部分,以獲得一個能夠摺疊成您生成的結構的序列並進行驗證。就是這樣!您已成功設計出一個被預測能夠與您感興趣的蛋白質沿著“熱點”結合的蛋白質,也就是沿著由 ESMBind 模型或模型整合預測的結合位點子集給出的感興趣位點。請務必閱讀 RFDiffusion Github 上鍊接的 RFDiffusion 論文,並透過給他們的 Github 點贊來向 RFDiffusion 的開發者們表達支援。他們構建了一個了不起的蛋白質擴散模型!您還可以在 Neurosnap 上與更多蛋白質相關模型(包括 RFDiffusion)互動。