DeepLPI: a novel deep learning-based model for protein-ligand interaction prediction for drug repurposing

Student: Bomin Wei
Table: COMP2
Experimentation location: School, Home
Regulated Research (Form 1c): No
Project continuation (Form 7): No

Display board image not available

Abstract:

Bibliography/Citations:

Abbasi, K., Razzaghi, P., Poso, A., Amanlou, M., Ghasemi, J. B., & Masoudi-Nejad, A. (2020). DeepCDA: deep cross-domain compound–protein affinity prediction through LSTM and convolutional neural networks. Bioinformatics, 36(17). https://doi.org/10.1093/bioinformatics/btaa544

Ballester, P. J., & Mitchell, J. B. O. (2010). A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics, 26(9). https://doi.org/10.1093/bioinformatics/btq112

Bepler, T., & Berger, B. (2021). Learning the protein language: Evolution, structure, and function. Cell Systems, 12(6). https://doi.org/10.1016/j.cels.2021.05.017

Giorgi, G., Lecca, L. I., Alessio, F., Finstad, G. L., Bondanini, G., Lulli, L. G., Arcangeli, G., & Mucci, N. (2020). COVID-19-Related Mental Health Effects in the Workplace: A Narrative Review. International Journal of Environmental Research and Public Health, 17(21), 7857. https://doi.org/10.3390/ijerph17217857

He, K., Zhang, X., Ren, S., & Sun, J. (2015a). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

He, K., Zhang, X., Ren, S., & Sun, J. (2015b). Deep Residual Learning for Image Recognition.

He, T., Heidemeyer, M., Ban, F., Cherkasov, A., & Ester, M. (2017). SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines. Journal of Cheminformatics, 9(1). https://doi.org/10.1186/s13321-017-0209-z

Jaeger, S., Fulle, S., & Turk, S. (n.d.). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition.

Li, H., Leung, K.-S., Wong, M.-H., & Ballester, P. (2015). Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest. Molecules, 20(6). https://doi.org/10.3390/molecules200610947

Liu, T., Lin, Y., Wen, X., Jorissen, R. N., & Gilson, M. K. (2007). BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Research, 35(Database). https://doi.org/10.1093/nar/gkl999

Liu, Z., Li, Y., Han, L., Li, J., Liu, J., Zhao, Z., Nie, W., Liu, Y., & Wang, R. (2015). PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics, 31(3). https://doi.org/10.1093/bioinformatics/btu626

Özçelik, R., Öztürk, H., Özgür, A., & Ozkirimli, E. (2021). ChemBoost: A Chemical Language Based Approach for Protein – Ligand Binding Affinity Prediction. Molecular Informatics, 40(5). https://doi.org/10.1002/minf.202000212

Öztürk, H., Özgür, A., & Ozkirimli, E. (2018). DeepDTA: Deep drug-target binding affinity prediction. Bioinformatics, 34(17), i821–i829. https://doi.org/10.1093/bioinformatics/bty593

Öztürk, H., Ozkirimli, E., & Özgür, A. (2019). WideDTA: prediction of drug-target binding affinity. http://arxiv.org/abs/1902.04166
Pushpakom, S., Iorio, F., Eyers, P. A., Escott, K. J., Hopper, S., Wells, A., Doig, A., Guilliams, T., Latimer, J., McNamee, C., Norris, A., Sanseau, P.,

Cavalla, D., & Pirmohamed, M. (2019). Drug repurposing: progress, challenges and recommendations. Nature Reviews Drug Discovery, 18(1). https://doi.org/10.1038/nrd.2018.168

Singh, T. U., Parida, S., Lingaraju, M. C., Kesavan, M., Kumar, D., & Singh, R. K. (2020). Drug repurposing approach to fight COVID-19. In Pharmacological Reports (Vol. 72, Issue 6, pp. 1479–1508). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/s43440-020-00155-6

Sun, P., Guo, J., Winnenburg, R., & Baumbach, J. (2017). Drug repurposing by integrated literature mining and drug–gene–disease triangulation. Drug Discovery Today, 22(4). https://doi.org/10.1016/j.drudis.2016.10.008

Thafar, M. A., Thafar, M. A., Olayan, R. S., Olayan, R. S., Ashoor, H., Ashoor, H., Albaradei, S., Albaradei, S., Bajic, V. B., Gao, X., Gojobori, T., Gojobori, T., & Essack, M. (2020). DTiGEMS+: Drug-target interaction prediction using graph embedding, graph mining, and similarity-based techniques. Journal of Cheminformatics, 12(1). https://doi.org/10.1186/s13321-020-00447-2

The AlphaFold team. (2020, November 30). AlphaFold: a solution to a 50-year-old grand challenge in biology. https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology

The Drug Development Process. (2018, January 4). https://www.fda.gov/patients/learn-about-drug-and-device-approvals/drug-development-process

Wallach, I., Dzamba, M., & Heifets, A. (2015). AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery.

Wang, S., Liu, D., Ding, M., Du, Z., Zhong, Y., Song, T., Zhu, J., & Zhao, R. (2021). SE-OnionNet: A Convolution Neural Network for Protein-Ligand Binding Affinity Prediction. Frontiers in Genetics, 11. https://doi.org/10.3389/fgene.2020.607824

Weininger, D. (1990). SMILES. 3. DEPICT. Graphical depiction of chemical structures. Journal of Chemical Information and Modeling, 30(3). https://doi.org/10.1021/ci00067a005

Wouters, O. J., McKee, M., & Luyten, J. (2020). Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. JAMA, 323(9). https://doi.org/10.1001/jama.2020.1166


Additional Project Information

Project website: -- No project website --
Presentation files:
Research paper:
Additional Resources: -- No resources provided --
Project files:
Project files
 

Research Plan:

Rationale:
By the end of 2021, COVID-19 has recorded more than 285 million confirmed cases and caused more than 5 million deaths globally. With waning immunity from vaccines and the emergence of new variants, there is still a huge need for effective drugs against the virus. (Giorgi et al., 2020; Singh et al., 2020) Drug companies have prioritized the development of such drugs, but their progress still is not fast enough. (The Drug Development Process, 2018) After a prospective drug is found, researchers must identify and minimize the risk of possible side effects, as well as maximize the drug’s disease-fighting potency. Candidate drugs must go through several rounds of animal and human trials, and even if they are proven to be safe and effective, patience is needed before mass production. (Pushpakom et al., 2019; The Drug Development Process, 2018) This entire process costs great amounts of time and resources. Based on a report in 2018, introducing new drugs into the market requires 10 to 20 years and more than 2 billion US dollars. (Pushpakom et al., 2019; Wouters et al., 2020)

Drug repurposing is possibly a game-changer. Following this strategy, researchers intend to find a cure by repurposing existing drugs that have been approved for other diseases. (Pushpakom et al., 2019) Since it involves the use of de-risked compounds, drug repurposing is expected to lower the overall development costs and shorten the development timelines. (Pushpakom et al., 2019; Sun et al., 2017) However, it remains a challenge to test a large pool of approved drugs for their effectiveness against the virus.

When researchers search for repurposable drug candidates, a key measure is a binding affinity, which represents the strength of the binding interaction between a viral protein to an anti-virus drug. (Ballester & Mitchell, 2010; Singh et al., 2020) A stronger binding affinity means that the drug can inhibit the target protein more effectively than other compounds, suggesting a better chance of curing the infection. (Pushpakom et al., 2019) Thanks to breakthroughs in deep learning algorithms, drug-target interaction (DTI) prediction models are widely used to automate the searching process by scanning a large pool of existing drugs and predicting the interaction between drug molecular and protein sequence, promising to further shorten the development process. (Pushpakom et al., 2019)

The DTI models currently used for drug repurposing strategies can be classified into two categories. The first category includes Simboost (T. He et al., 2017), DTiGEMS+ (Thafar et al., 2020), and models based on random forest (RF) (Li et al., 2015) and support vector machine (SVM) (Li et al., 2015). These models rely on features extracted by human experts in some pre-processing stages and are typically fast in making DTI predictions. However, since the extraction of features requires human expertise, the lack of it may lead to the loss of valuable information of the drug molecules and protein sequence, resulting in low prediction precision. (Özçelik et al., 2021)

The other type of models, such as Atomnet (Wallach et al., 2015) and SE-OnionNet(Wang et al., 2021), use 3-dimensional (3D) spatial structures of proteins and molecules for DTI predictions. While these models dramatically minimize the loss of information, they make predictions at a much slower rate, typically 1 drug-target pair per day. Moreover, in most databases, 3-D structure data are available for only a small fraction (0.2%-0.5%, depending on the database) of drug-protein pairs (Z. Liu et al., 2015). Due to data limitations, these models can only be applied to make DTI predictions in the specified database, whereas any real-world situation requires a model to be applied across different interaction mechanisms. (Wang et al., 2021) Low transferability is the main drawback of these models.

In this project, I intend to construct a model, which I call DeepLPI that achieves high precision and high transferability, while at the same time producing predictions at a fast rate. The model will be tested in classification tasks to predict binding or non-binding between drug-protein pairs. Instead of 3-D structure data which are rarely available in databases, I propose to use 1-D protein sequence and molecular SMILES (Simplified Molecular Input Line Entry System) data. (Weininger, 1990) It may be a questionable approach to use 1-D protein sequence in medical research, but recent progress in computational models backs up this approach with strong outcomes. AlphaFold2 (The AlphaFold team, 2020), a model published by DeepMind, allows researchers to predict protein structure with sequence. The success of AlphaFold implies that we can make predictions based solely on the protein sequence. With this choice, my model can be trained with 20 to 500 times more data points for higher prediction accuracy, and the accessibility of new data points also makes my model applicable to targets previously ignored by state-of-the-art models, delivering higher transferability.

 

Research Questions: 
How can we build a model to accurately predict drug-target interactions using 1-dimensional molecular and protein sequence data as input, which reaches both high precision and high transferability?
1) What databases are proper for the model training and testing? Can I retrieve high-quality data?
2) What binding affinity labels are the best to be considered for model training and transferability?
3) There are multiple state-of-the-art NLP methods for embedding. Which method is optimal for embedding protein sequence to achieve high prediction accuracy and avoid overfitting?
4) After embedding, how many layers of Convolution Neural Networks are the best for the extraction of features? Will the advanced ResNet method help speed up the extract and achieve higher accuracy?
5) LSTM has been a great AI method in language processing. Will it be useful in dealing with drug-protein interactions? Should bi-directional be used to avoid single-direction information loss? What are the optimal hyperparameters for implementing such as the number of layers and number of cells?
6) Will our model work both for a classification task and a regression task?
7) What hyperparameters are the best for MLP fully connected layers such as number of layers, number of neurons? Where in the network shall I implement the ReLU and dropout layers to optimize the model?
8) Will the model perform better than the current state-of-the-art model DeepCDA?
9) Will the model produce meaningful predictions on COVID-19 data?
10) What performance metrics should I adopt in evaluating the model? How should I analyze the performance of the model on different scenarios of the test database?

 

Procedures:
First I needed to learn the basics of machine learning methods and learn how to implement them with popular python packages. I studied the college-level course materials with my research advisor and used online video and classes resources. There were a number of important methods and implementation I learned, including the traditional Logistic Regression, SVM, DecisionTree, Random Forest, and also neural network implementations in Pytorch including MLP, CNN, and LSTM.

  1. Prepare GPU computing resources and environment. This includes
    1. Build Anaconda environment and install necessary python packages for the deep neural networks such as Pytorch, Numpy, Scipy, Matplotlib, Pandas, Rdkit.
  2. Find the appropriate database, candidates are BindingDB, ChemBL, Davis, Kiba, including,
    1. Find out proper ways of data downloading
    2. Learn to process the data format
    3. Choose proper binding affinity labels and clean them for high-quality data input. Also, convert the labels according to the literature suggested threshold into binary 0-1 values for classification tasks.
  3. Research and implement embedding methods that convert drug molecule strings and protein sequences into numeric vector representations
    1. Research the methods candidates, including seqvec, prot2vec, mol2vec, ProSE, AllenNLP.
    2. Learn how to use the models to actually obtain the vector representation. Adjust parameters of the embedding according to our needs. Do the embeddings and save the embedded vectors separately for drugs and proteins.
  4. Build the first version of the prediction network
    1. Start from simple CNN feature extraction and add MLP layers to obtain prediction values.
    2. Train and test the model performance on one example dataset. The choice was BindingDB.
  5. Revise the model network.
    1. Add ResNet blocks, add LSTM modules, adjust the number of layers in every module and optimize the network performance.
  6. Optimize the model hyperparameter
    1. Optimize ResNet kernel size
    2. Adjust pooling method choice and size
    3. Adjust biLSTM layers and size
    4. Adjust dropout size
    5. Adjust model parameter initialization methods
    6. Adjust optimizer choices (Adam, SGD parameters) and adjust regularization methods
    7. Adjust learning rate and stopping criteria, to obtain the best model.
  7. Repeat steps 6 to 7 to obtain various versions of models and compare their performance on the testset.
    1. Use AUROC for classification tasks to determine the optimal network architecture.
  8. Evaluate performance
    1. Divide the independent testset into four different scenarios according to whether the protein or the drug bio-chemical information has been seen in the training set by the network.
    2. Calculate and collect all classification metrics on all different testsets and analyze the model performance. The choices were AUROC, sensitivity, specificity, PPV, and NPV.
    3. After evaluation and comparison of all models on the classification task, evaluation model performance on regression tasks using MSE related metrics, if time permits.
  9. Model comparison.
    1. Apply the same evaluation method using DeepCDA model downloaded from the GitHub.
    2. Record the model performance and compare it with my model performance.
  10. Apply the trained model on the external dataset, such as the COVID-19 data set, to evaluate transferability and try to find a possible effective drug.
     

Risk and Safety:
This project involves only the use of computers and publicly accessible databases, and therefore the project does not have any risk or safety issues.

 

 

Questions and Answers

1. What was the major objective of your project and what was your plan to achieve it? 

The major objective was to help speed up the drug development process and find effective cures for the SARS-COV-2 virus.

1) Drug repurposing is a new strategy to reuse drugs that have been approved for other diseases, to lower the development costs and shorten the development timeline.

2) Among all the drug repurposing methods, computational DTI (drug-target interaction) models can predict binding affinities between drugs and proteins to find novel candidates, and thus are promising to further shorten the development process.

I plan to build a fast and accurate machine learning method to predict drug-protein binding affinity for application on drug repurposing, hoping the method can help find an effective COVID-19 drug.

To achieve this goal, I first researched the current machine learning methods and found they used complex 3D spatial structures for predictions. The methods were either fast but inaccurate due to information loss during human feature selection or were accurate but limited in applicability due to lack of 3D protein information.

Then I proposed the idea to use the drug molecule SMILES string and protein amino acid sequence as inputs, which are widely available information, and designed the modeling methods based on Natural Language Processing techniques to embed the input information, followed by a network composed of ResNet, LSTM and MLP deep neural networks.

For a brief overview of my plan,
1) the network model will be trained and tested on data from the publicly accessible databases, namely BindingDB and Davis.
2) The interaction data will be treated as binary binding/non-binding first as a classification task, and later if time permits, will be treated as continuous values in a regression task.
3) The BindingDB trained model will also be applied on Davis and COVID-19 dataset to test performance on external data.
4) The model performance will also be compared with the recently published state-of-the-art model found in the literature, i.e., DeepCDA.

 

       a. Was that goal the result of any specific situation, experience, or problem you encountered?  

After the COVID-19 pandemic breakout in the US, the schools had to close, and I had to stay home and took online classes. I had been longing to get back to school with friends and in-person classes and started to apply what I had learned in computational modeling to forecast future trends.

Vaccination was a major hope of returning to normal life at that moment. I worked on a computational model to predict the virus mutation probability to help with vaccine development. I had great fun with the project, and it won an honorable mention award in Mercer Science and Engineering Fair 2021. During the interview session, one judge challenged me to think big about how to use the predictive power to find better cures for COVID-19.

Following the advice, I started to search for ideas and encountered the research papers about using machine learning techniques to speed up drug development for COVID-19.

I was surprised to learn how long drug development would cost. Drug companies have prioritized the development of such drugs, but their progress still is not fast enough. After a prospective drug is found, researchers must identify and minimize the risk of possible side effects, as well as maximize the drug’s disease-fighting potency. Candidate drugs must go through several rounds of animal and human trials, and even if they are proven to be safe and effective, patience is needed before mass production. This entire process costs great amounts of time and resources. As shown in Figure 1, based on a report in 2018, introducing new drugs into the market requires 10 to 20 years and more than 2 billion US dollars.

I learned the interesting concept of drug repurposing that can accelerate the safety trial by reusing approved drugs. But the strategy still has to test a large number of candidates, and the current computational and machine learning methods are not effective enough in virtual drug screening because they rely too much on the complex spatial structure.

A previous experience from the 2020 computation competition reminded me of some interesting concepts in Natural Language Processing. Why not treat the drug molecules and protein sequences as sentences in language? Their sequence in fact bears chemical and biological meanings that determine the spatial structures. By that time, the success report of AlphaFold2 in using sequence to predict protein structure offered me great confidence. Therefore, I took on the project and started to learn how to build deep neural networks and integrate NLP techniques in prediction drug and protein interactions for drug development.

 

       b. Were you trying to solve a problem, answer a question, or test a hypothesis?

I was trying to solve a problem and test a hypothesis.

I was trying to solve the problem of slow and costly drug development by aiding drug repurposing with new machine learning computational methods.

I was testing my hypothesis that joint adoption of deep learning-based embedding techniques to capture contextual information about drugs and protein one-dimensional sequence, and deep learning model to automatically extract useful features enables to predict drug-target interaction faster and more accurately.

 

2. What were the major tasks you had to perform in order to complete your project?

  1. First I needed to learn the basics of machine learning methods and learn how to implement them with popular python packages. I studied the college-level course materials with my research advisor and used online video and classes resources. There were a number of important methods and implementation I learned, including the traditional Logistic Regression, SVM, DecisionTree, Random Forest, and also neural network implementations in Pytorch including MLP, CNN, and LSTM.
  2. Prepare GPU computing resources and environment. This includes
    1. Build Anaconda environment and install necessary python packages for the deep neural networks such as Pytorch, Numpy, Scipy, Matplotlib, Pandas, Rdkit.
  3. Find the appropriate database, candidates are BindingDB, ChemBL, Davis, Kiba, including,
    1. Find out proper ways of data downloading
    2. Learn to process the data format
    3. Choose proper binding affinity labels and clean them for high-quality data input. Also, convert the labels according to the literature suggested threshold into binary 0-1 values for classification tasks.
  4. Research and implement embedding methods that convert drug molecule strings and protein sequences into numeric vector representations
    1. Research the methods candidates, including seqvec, prot2vec, mol2vec, ProSE, AllenNLP.
    2. Learn how to use the models to actually obtain the vector representation. Adjust parameters of the embedding according to our needs. Do the embeddings and save the embedded vectors separately for drugs and proteins.
  5. Build the first version of the prediction network
    1. Start from simple CNN feature extraction and add MLP layers to obtain prediction values.
    2. Train and test the model performance on one example dataset. The choice was BindingDB.
  6. Revise the model network.
    1. Add ResNet blocks, add LSTM modules, adjust the number of layers in every module and optimize the network performance.
  7. Optimize the model hyperparameter
    1. Optimize ResNet kernel size
    2. Adjust pooling method choice and size
    3. Adjust biLSTM layers and size
    4. Adjust dropout size
    5. Adjust model parameter initialization methods
    6. Adjust optimizer choices (Adam, SGD parameters) and adjust regularization methods
    7. Adjust learning rate and stopping criteria, to obtain the best model.
  8. Repeat steps 6 to 7 to obtain various versions of models and compare their performance on the testset.
    1. Use AUROC for classification tasks to determine the optimal network architecture.
  9. Evaluate performance
    1. Divide the independent testset into four different scenarios according to whether the protein or the drug bio-chemical information has been seen in the training set by the network.
    2. Calculate and collect all classification metrics on all different testsets and analyze the model performance. The choices were AUROC, sensitivity, specificity, PPV, and NPV.
    3. After evaluation and comparison of all models on the classification task, evaluation model performance on regression tasks using MSE related metrics, if time permits.
  10. Model comparison.
    1. Apply the same evaluation method using DeepCDA model downloaded from the GitHub.
    2. Record the model performance and compare it with my model performance.
  11. Apply the trained model on the external dataset, such as the COVID-19 data set, to evaluate transferability and try to find a possible effective drug.

 

       a. For teams, describe what each member worked on.

This was not a team project.

 

3. What is new or novel about your project?

1. My work adopted Natural Language Processing technique to embed input drug and protein information for the predictions of interactions. The result proved the NLP embedding was successful in increasing the prediction accuracy.

2. The network model integrated ResNet for feature extraction. ResNet has been a fast-developing method in image processing. The result proved the integration of ResNet increased the model accuracy performance.

 

       a. Is there some aspect of your project's objective, or how you achieved it that you haven't done before?

I had never studied the interaction between drugs and molecules before.

I had never processed drug-protein interaction data before

I had never designed, built or tested any machine learning models, especially deep neural network models before.

I had never learned or used ResNet, Convolution Neural Network, or LSTM before.

All neural network was built based on the Pytorch framework. I used Python before, but not the Pytorch framework.

 

       b. Is your project's objective, or the way you implemented it, different from anything you have seen?

Using NLP embedding to treat drug and protein information for drug-protein interaction predictions is a proposal I have never seen before.

Integrating NLP, ResNet, and together with LSTM, a prediction neural network has not been seen before.

Usually, in machine learning, people test their model performance on a random split of the obtained data. I proposed a new way to split the data into four different test sets where the drug or protein information may or may not be seen in the training set. This test set can better reflect the model performance in real-life applications. I have not seen any proposal for model testing like this method before.

 

       c. If you believe your work to be unique in some way, what research have you done to confirm that it is?

I searched on google scholar about drug-target interaction and similar keywords, including drug and ligand for a synonym of drug, protein, and target for a synonym of protein, and interaction and (binding) affinity for a synonym of interaction.

I searched NLP embedding methods and their application on drug molecule information and protein sequence information.

I also tracked the reference based on the papers I found. I checked the papers which cited the works found from the above search. I found many useful literature papers to enrich my knowledge in the field, and I confirm that no similar proposal or similar work has been done before.

 

4. What was the most challenging part of completing your project?

Designing and optimizing a proper model architecture was the most challenging part. The deep neural network model has very complex architectures. A complex architecture may capture more features but at the same lead to overfitting.

 

      a. What problems did you encounter, and how did you overcome them?

  1. The most challenging part of the project is how to build a good model that can both have high prediction precision to be competitive with previos models that use high-dimensional inputs, and that can also have high transferability by using only 1D sequence for molecules and proteins as input. I built nine versions of the model in total: I tested many embedding methods, such as SeqVec, AllenNLP, and Prot2Vec; and compared the performance of different model architectures, such as adding a transformer in the head module and using LSTM on molecule and protein respectively after the CNN module. In fact, most of the approaches didn't work, but they helped me to take one step closer to the best model architecture. With testing all possible failures, I eventually ended up with the best architecture that the model currently uses.
  2. Once when I discussed my project with a senior school friend majoring in biology, she told me that the way to predict without using the protein 3D structure is questionable - for a single protein sequence, there are a lot of different possible 3D structures. For each structure, the position and way that the molecular ligand binds on the protein are different, so this prediction is unrealistic in theory. However, I believe this can be solved. DeepMind published a protein structure prediction model called Alphafold 2, which allowed people to predict protein structure with sequence. The success of Alphafold 2 told me that we could predict with only the protein sequence. Even though the theoretical background of using 1D protein sequence to capture 3D spatial structure has not yet been solidly established, I found confidence in my proposal to find a way to extract information from the single protein sequence string that keeps most of the information we need for making the prediction.

 

      b. What did you learn from overcoming these problems?

Firstly, I learned in solving research problem, taking the time and effort to do the test is more effective than hesitating for a better design. Plus, it is very rewarding and exciting to implement and test out an idea.

Secondly, I realized in research work, a lot of proposals might be solidly supported with a rigorous theory. On one hand, we should try our best to reason the design with existing theories to make sure it won’t fail on obvious mistakes, and on the other hand, we should be courageous to take on the challenge and test out our proposal with careful implementations.

 

5. If you were going to do this project again, are there any things you would you do differently the next time?

The embedding method is the key to the success of this model proposal. Recently from communication with an NLP expert, I learned a newly emerging method called dual-learning that can translate between different languages more effectively. If I were going to do the project again, I would talk to these NLP experts and try to transfer the most state-of-the-art NLP embedding methods for my network model, which would possibly enhance the performance.

Meanwhile, the database I am currently using, BindingDB, does not provide access to commercial datasets, which contain more data relevant to finding a cure for COVID-19. This limitation renders the database less suitable to represent the entire knowledge on drug candidate molecules and COVID-19 proteins. Should I have the chance to access more data, especially on COVID-19 studies, I hope to cooperate with drug-developing companies and test my model on their lab results.

 

6. Did working on this project give you any ideas for other projects? 

Undoubtedly yes! This is the first time I learned machine learning techniques. The power of them impressed me. The machine learning techniques can dig out great insights from the large publicly accessible databases, not only in the field of biology or biomedical, but potentially can work well also in social study, economics, and psychology.

I am also fascinated by the fundamental principle behind machine learning algorithms and plan to work on designing faster and more accurate low-level algorithms in embedding sequence-like information.

 

7. How did COVID-19 affect the completion of your project?

COVID-19 has been devastating to everyone around the globe. It is a battleground where I can put what I have learned into real life, especially in computational modeling; and I am determined to keep studying to help in the pandemic, which was the motivation of this project.

My project is completely based on computer programming and online platforms such as cloud computing clusters and publicly accessible databases. Therefore, the COVID-19 did not affect the implementation of the project.

For the most part of the project, I have been receiving advice from my research advisor through online meetings. The online communications are effective, but it could have been better if we could meet in person. I believe we could generate more innovative ideas through in-person meetings.