Mercer Science and Engineering Fair | Fostering STEM Education in Mercer County, NJAffiliated with Regeneron ISEF and Thermo Fisher Junior Innovators Challenge

RARE: Machine Learning Approach for Binning Rare Variant Features to Detect Association with Disease

Fair: 2022 Mercer Science and Engineering Fair

Event: Senior Division 2022

Category: Software and Embedded Systems

Student: Satvik Dasariraju

Table: COMP1

Experimentation location: Reseach Institution, Home, Research conducted at home, during remote internship at UPenn

Regulated Research (Form 1c): No

Project continuation (Form 7): No

Abstract:

Bibliography/Citations:

Bibliography from Literature Review:

(Note: references include sources cited in this document as well as other journal articles read prior to the start of this project)

Ali, A. A., Almukhtar, S. E., Abd, K. H., Saleem, Z. S. M., Sharif, D. A., & Hughson, M. D. (2021). The causes and frequency of kidney allograft failure in a low-resource setting: Observational data from Iraqi Kurdistan. BMC Nephrology, 22(1), 272. https://doi.org/10.1186/s12882-021-02486-9

Bicalho, P. R., Requião-Moura, L. R., Arruda, É. F., Chinen, R., Mello, L., Bertocchi, A. P. F., Lamkowski Naka, E., Tonato, E. J., & Pacheco-Silva, A. (2019). Long-term outcomes among kidney transplant recipients and after graft failure: A single-center cohort study in brazil. BioMed Research International, 2019, 1–10. https://doi.org/10.1155/2019/7105084

Continuous distribution. (n.d.). Retrieved December 20, 2021, from https://optn.transplant.hrsa.gov/policies-bylaws/a-closer-look/continuous-distribution/

Davis, S., & Mohan, S. (2021). Managing patients with failing kidney allograft: Many questions remain. Clinical Journal of the American Society of Nephrology, CJN.14620920. https://doi.org/10.2215/CJN.14620920

Fiorentino, M., Gallo, P., Giliberti, M., Colucci, V., Schena, A., Stallone, G., Gesualdo, L., & Castellano, G. (2021). Management of patients with a failed kidney transplant: What should we do? Clinical Kidney Journal, 14(1), 98–106. https://doi.org/10.1093/ckj/sfaa094

Gómez, E. J., Jungmann, S., & Lima, A. S. (2018). Resource allocations and disparities in the Brazilian health care system: Insights from organ transplantation services. BMC Health Services Research, 18(1), 90. https://doi.org/10.1186/s12913-018-2851-1

Kamoun, M., McCullough, K. P., Maiers, M., Fernandez Vina, M. A., Li, H., Teal, V., Leichtman, A. B., & Merion, R. M. (2017). HLA amino acid polymorphisms and kidney allograft survival. Transplantation, 101(5), e170–e177. https://doi.org/10.1097/TP.0000000000001670

Lee, S., Abecasis, G. R., Boehnke, M., & Lin, X. (2014). Rare-variant association analysis: Study designs and statistical tests. The American Journal of Human Genetics, 95(1), 5–23. https://doi.org/10.1016/j.ajhg.2014.06.009

Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J., Nickerson, D. A., Christiani, D. C., Wurfel, M. M., & Lin, X. (2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. The American Journal of Human Genetics, 91(2), 224–237. https://doi.org/10.1016/j.ajhg.2012.06.007

Li, B., & Leal, S. M. (2008). Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. The American Journal of Human Genetics, 83(3), 311–321. https://doi.org/10.1016/j.ajhg.2008.06.024

Liu, Y., Huang, J., Urbanowicz, R. J., Chen, K., Manduchi, E., Greene, C. S., Moore, J. H., Scheet, P., & Chen, Y. (2020). Embracing study heterogeneity for finding genetic interactions in large-scale research consortia. Genetic Epidemiology, 44(1), 52–66. https://doi.org/10.1002/gepi.22262

Madsen, B. E., & Browning, S. R. (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics, 5(2), e1000384. https://doi.org/10.1371/journal.pgen.1000384

Moore, C. B., Wallace, J. R., Frase, A. T., Pendergrass, S. A., & Ritchie, M. D. (2013). BioBin: A bioinformatics tool for automating the binning of rare variants using publicly available biological knowledge. BMC Medical Genomics, 6(2), S6. https://doi.org/10.1186/1755-8794-6-S2-S6

Morgenthaler, S., & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (Cast). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1–2), 28–56. https://doi.org/10.1016/j.mrfmmm.2006.09.003

Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? The American Journal of Human Genetics, 69(1), 124–137. https://doi.org/10.1086/321272

Reich, D. E., & Lander, E. S. (2001). On the allelic spectrum of human disease. Trends in Genetics, 17(9), 502–510. https://doi.org/10.1016/S0168-9525(01)02410-6

Van Loon, E., Bernards, J., Van Craenenbroeck, A. H., & Naesens, M. (2020). The causes of kidney allograft failure: More than alloimmunity. A viewpoint article. Transplantation, 104(2), e46–e56. https://doi.org/10.1097/TP.0000000000003012

Zhang, X., Basile, A. O., Pendergrass, S. A., & Ritchie, M. D. (2019). Real world scenarios in rare variant association analysis: The impact of imbalance and sample size on the power in silico. BMC Bioinformatics, 20(1), 46. https://doi.org/10.1186/s12859-018-2591-6

Additional Project Information

Project website: https://github.com/UrbsLab/RARE

Presentation files:

Research paper:

Additional Resources: -- No resources provided --

Project files:

Project files

Presentation files

Research paper

Research Plan:

Research Plan/Project Summary

Initial plan completed on 4/1/21; Addendum completed 12/20/21 (see end of document)

RARE: Machine Learning Approach for Binning Rare Variant Features to Detect Association with Disease

Student Researcher: Satvik Dasariraju

Adult Sponsor: Dr. Ryan J. Urbanowicz

Rationale:

In numerous biomedical fields such as genetics, features with rare variant states pose challenges to analysis. While genome-wide association studies have demonstrated the role of common genetic variants in disease etiology, the effects of rare genetic variants (defined as having a minor allele frequency below 0.05) remain largely unexplained (Pritchard, 2001; Reich & Lander, 2001; Lee et al., 2014). Rare genetic variants are hypothesized to hold the key to explaining missing heritability in complex diseases, yet traditional univariate analysis methods are underpowered to detect rare variant association due to the low frequency of rare genetic variants (Li & Leal, 2008). Numerous alternatives have been presented in the literature, including the cohort allelic sum test (Morgenthaler & Thilly, 2007), nonparametric weighted sum test (Madsen & Browning, 2009), sequence kernel association test (Lee et al., 2012), and BioBin (Moore et al., 2013); these methods aim to overcome the challenges of rare variant association analysis by binning (i.e., grouping) rare variant features. Yet, all previous rare variant binning methods depend on some form of expert knowledge for initialization. Previous methods also have not considered interactions between features and/or interactions between features and bins despite epistasis and other forms of feature interactions being recognized as contributors to complex disease heritability (Liu et al., 2020). Finally, previous methods focused exclusively on genetics without developing algorithms that are versatile for a wide range of applications on biomedical data.

Kidney transplant failure is a serious condition that poses a high burden of psychologic and medical morbidity, affecting about one in five kidney transplant recipients (Davis & Mohan, 2021). Transplantation failure poses an especially high burden on third world countries like Brazil and Iraq (Ali et al., 2016; Gómez et al., 2018). There is an urgent need to better understand the genetic and molecular mechanisms of transplantation failure. In kidney transplantation, features with rare variant states are particularly critical given the importance of donor-recipient HLA amino acid mismatches in kidney failure and the extremely low frequency of donor-recipient mismatches at many amino acid positions. Accordingly, an algorithm capable of binning donor-recipient HLA amino acid mismatches to detect association to kidney graft failure could elucidate relationships between HLA amino acids, overcome difficulties posed by the low frequency and multicollinearity of HLA amino acids, aid in prediction of kidney graft failure, and guide clinical and organizational efforts such as development of a new calculated panel reactive antibody test for HLA sensitization and the Organ Procurement and Transplantation Network’s new continuous distribution initiative (Continuous Distribution).

This research plan proposes an evolutionary algorithm (a type of machine learning strategy) that constructs bins (i.e. groups) of features with rare variants and iteratively modifies bins to reach the highest possible association with outcome. The algorithm will be tested with simulation studies and applied to bin HLA amino acid mismatches to detect association to kidney graft failure.

Research Questions for Objective 1:

What is the capability and efficiency of the proposed evolutionary algorithm to construct bins with univariate association to outcome?
What is the capability and efficiency of the proposed evolutionary algorithm to construct bins with epistatic interactions predictive of outcome?
What are the best parameters for the proposed evolutionary algorithm for binning rare variant features?
How does expert knowledge input make the algorithm more reliable and/or efficient?

Research Questions for Objective 2:

What is the capability and efficiency of a machine learning (specifically a genetic/evolutionary algorithm) to optimize bins of HLA amino acids to detect association of amino acid mismatches to kidney graft failure?
At which locus of HLA amino acids (A, B, C, DR, or DQ) are mismatches most predictive of kidney graft failure?
How do the proposed algorithm’s bins compare to expert knowledge sequence feature variant type bins of amino acid positions?

Hypothesis for Objective 1:

If an evolutionary algorithm for constructing bins of rare variant features is developed, then it will be able to reliably construct bins with univariate and epistatic effects (the latter of which will be achieved by scoring bins with methods sensitive to interactions), thus overcoming the limitations of previous rare variant association analysis methods by (1) being able to detect interactions, (2) not requiring but allowing expert knowledge input for initialization of bins, and (3) considering groups of features additively to reach the statistical power required to analyze rare variant features.

Hypothesis for Objective 2:

If the proposed algorithm is applied to HLA amino acid mismatches and kidney graft failure, then it will be able to successfully construct bins of HLA amino acid mismatches with higher association to kidney graft failure compared to individual amino acid mismatches or expert knowledge sequence feature variant type bins because the algorithm adopts a data-driven approach and eliminates assumptions about relationships between amino acid positions across HLA loci.

Engineering Goals:

Objective 1: Development and validation of an evolutionary algorithm capable of constructing bins of rare variant features with optimal association to outcome
- Sub-objective 1-1: The algorithm should be able to construct bins with univariate association to outcome
- Sub-objective 1-2: The algorithm should be able to construct bins with epistatic interactions that are predictive of outcome
- Sub-objective 1-3: Development of a synthetic data simulator to validate sub-objectives 1-1 and 1-2
- Sub-objective 1-4: The algorithm should be able to either start/initialize without expert knowledge or use expert knowledge bins as input for initialization
Objective 2: Application of the algorithm developed in objective 1 to construct bins of HLA amino acid positions with association to kidney graft failure
- Sub-objective 2-1: Compare bins constructed by algorithm to expert knowledge sequence feature variant type bins
- Sub-objective 2-2: Use algorithm to bin amino acids within each HLA locus separately and compare bins from each locus (especially DR vs. DQ locus)

Expected Outcomes:

Objective 1: An evolutionary algorithm (a type of machine learning algorithm) that can construct bins with both univariate and epistatic effects with above 90% accuracy when tested against simulated datasets with known/predetermined optimal bins

Objective 2: When applied to HLA amino acid positions and kidney graft failure, the algorithm should construct bins with higher association to graft failure compared to individual amino acid positions and sequence feature variant type bins

Materials:

Personal computer (PC)
Connection to computing cluster at Penn Medicine
Donor-recipient amino acid position mismatch and graft failure data from Scientific Registry of Transplant Recipients, obtained through collaborator Dr. Malek Kamoun

(Note: all the data is anonymous and de-identified)

Jupyter notebooks for documentation and coding in the Python programming language
Various open source Python packages including sklearn, numpy, and pandas.

Procedures:

Steps of Proposed Algorithm:

Initialize bins with either random grouping of features or expert knowledge input
Repeated evolutionary cycles (mimics natural selection process)
1. Bin fitness evaluation with chi-square or MultiSURF
2. Preserve elite bins for next generation of bins, remove poor bins
3. Genetic operations to choose pairs of parent bins and then create offspring bins through crossover of parent bin pair and mutation
4. Add offspring to next generation of bins
Final bin evaluation and summary

Objective 1:

Develop and code the proposed evolutionary algorithm for rare variant feature binning
Develop a rare variant data simulator to test algorithm’s ability to construct univariate bins
Develop a rare variant data simulator to test algorithm’s ability to construct epistatic bins
Run 30 trials of each of the following experiments (to validate the algorithm); for each experiment, one of the rare variant data simulators coded in the above steps will be used, so the features that belong in the optimal bin will be known, and the accuracy of the bin constructed by the algorithm can be measured against the optimal bin
1. Test algorithm with MultiSURF scoring on univariate data simulator with 0 noise
2. Test algorithm with MultiSURF scoring on univariate data simulator with 0.05 noise
3. Test algorithm with MultiSURF scoring on univariate data simulator with 0.1 noise
4. Test algorithm with MultiSURF scoring on univariate data simulator with 0.5 noise (negative control)
5. Test algorithm with Chi-square scoring on univariate data simulator with 0 noise
6. Test algorithm with MultiSURF scoring on univariate data simulator with 0 noise with bins forced to maintain fixed bin size
7. Test algorithm with MultiSURF scoring on univariate data simulator with 0 noise, expert knowledge input for initialization
8. Test algorithm with Chi-square scoring on epistatic data simulator with 0 noise
9. Test algorithm with MultiSURF scoring on epistatic data simulator with 0 noise

Objective 2:

Apply algorithm to construct bins of HLA amino acid mismatches with association to kidney graft failure, where all amino acids are eligible for inclusion in bins
Repeat step 1, but run the algorithm five separate times (1 for each locus of HLA: A, B, DR, DQ)
Evaluate bins in comparison to expert knowledge sequence feature variant type bins

Data Analysis:

Due to the multicomponent nature of the study, there will be numerous sets of variables and data analyses.

Objective 1 Part A: Testing Algorithm on Univariate Simulated Data with Different Levels of Noise

Independent Variable: Level of noise in data (0, 0.05, 0.1, 0.5)

Dependent Variable: Performance of algorithm, measured by percentage of correct features averaged across 30 replicate trials

Control Variables:

Parameters of algorithm
Characteristics of dataset (number of correct features, number of instances, etc.)
The same computer and computing cluster will be used for all runs in this project

Objective 1 Part B: Testing Algorithm on Univariate Simulated Data with Different Configurations

Independent Variable: Configurations of algorithm

Dependent Variable: Performance of algorithm, measured by percentage of correct features averaged across 30 replicate trials

Control Variables:

Characteristics of dataset (number of correct features, number of instances, etc.)
The same computer and computing cluster will be used for all runs in this project

Objective 1 Part C: Testing Algorithm on Univariate Simulated Data with Different Configurations

Independent Variable: Initialization (random vs. expert knowledge)

Dependent Variable: Performance of algorithm, measured by percentage of correct features averaged across 30 replicate trials

Control Variables:

Characteristics of dataset (number of correct features, number of instances, etc.)
The same computer and computing cluster will be used for all runs in this project
Parameters of algorithm aside from initialization

Objective 1 Part D: Testing Algorithm on Simulated Data with Epistasis

Independent Variable: Metric used for scoring bins in cycles of algorithm (MultiSURF vs. chi-square)

Dependent Variable: Performance of algorithm, measured by percentage of correct features averaged across 30 replicate trials

Control Variables:

Characteristics of dataset (number of correct features, number of instances, etc.)
The same computer and computing cluster will be used for all runs in this project

Objective 2 Part A: Comparing Algorithm’s Bins vs. Expert Knowledge Bins

Independent Variable: Whether bins were generated from algorithm or sequence feature variant type categories

Dependent Variable: Association between bin and kidney graft failure, measured by chi-square

Control Variables:

The same computer and computing cluster will be used for all runs in this project

Objective 2 Part B: Comparing Bins from Different HLA Loci

Independent Variable: HLA locus (A, B, C, DR, DQ)

Dependent Variable: Association between bin and kidney graft failure, measured by chi-square

Control Variables:

The same computer and computing cluster will be used for all runs in this project

Risk and Safety Evaluation:

There are no major hazards associated with the conducting of the project because all of the set up and experimentation will be done on a computer. Potential risks associated with using a computer are the common hazards of the internet. I will protect myself from these hazards by only downloading data and journal articles from secure sources, not communicating with strangers, and not revealing any personal information on the internet.

Addendum:

No major changes were made to the procedure or data analysis. The sole addition was the calculation of statistics to measure association between individual amino acid positions and kidney graft failure (which served as an additional point of comparison to demonstrate the utility of bins constructed by the algorithm). The name of the proposed algorithm was finalized as RARE (Relevant Association Rare-variant-bin Evolver).

It is worth noting that I am the first author of a peer-reviewed research paper based on the algorithm described in this project (i.e., objective 1 of this research plan). The manuscript was accepted to and presented at the Genetic and Evolutionary Computation Conference as a workshop paper has been published in GECCO '21: Proceedings of the Genetic and Evolutionary Computation Conference (link: https://doi.org/10.1145/3449726.3463174). The application of the algorithm, also described in this project, resulted in an abstract of which I am first author. The abstract was accepted to and presented at the American Society for Histocompatibility and Immunogenetics 47th Annual Meeting and also published in Human Immunology. The code has been made open source at https://github.com/UrbsLab/RARE.

Abstract for Project:

Features with rare states, such as rare genetic variants and low frequency amino acid mismatches, pose a challenge for statistical and machine learning analyses due to limited detection power and uncertainty surrounding their role in predicting outcomes such as disease phenotype. Bins of rare variant features hold the potential to increase association detection power. However, previous binning approaches were wholly dependent on prior-knowledge assumptions, instead of data-driven techniques, and ignored multivariate interactions. Previous binning methods were also confined to genetics despite the pertinence of rare variant features across biomedicine and other scientific fields. This study develops the Relevant Association Rare-variant-bin Evolver (RARE), the first evolutionary algorithm for automatically constructing and evaluating rare variant bins with either univariate or epistatic associations, offering flexibility to initialize bins both with or without expert knowledge. RARE’s ability to correctly bin simulated rare-variant associations is evaluated over a variety of algorithmic and dataset scenarios. Specifically, this study examines (1) ability to detect rare variant bins of univariate effect (with varying levels of noise), (2) using fixed vs. adaptable bins sizes, (3) employing expert knowledge to initialize bins, and (4) ability to detect rare variant bins interacting with a separate common variant. Results demonstrating the feasibility and efficacy of RARE are presented alongside an application of the algorithm in constructing bins of donor-recipient HLA amino acid positions associated with kidney graft failure in order to elucidate the role of HLA mismatches in transplantation.

Questions and Answers

1. What was the major objective of your project and what was your plan to achieve it?

The core objective of the project was to develop a method for association analysis of rare variant features (e.g., rare genetic variants or low frequency donor-recipient amino acid mismatches). Central considerations included developing a method (1) sensitive to interactions such as epistasis, (2) not dependent on expert/domain knowledge like previous methods, and (3) versatile for applications outside of just genetics.

To achieve this goal, the first step was to conceive of, map out, and design the algorithm, which I did with the guidance of Dr. Ryan J. Urbanowicz (the PI of the UPenn lab I was an employed research intern at). It was critical to have a pseudocode with clear algorithmic steps for the evolutionary algorithm (the specific machine learning strategy for constructing/optimizing the bins of rare variant features). After coding the project in python, there was a significant period of time for debugging, working to improve the algorithm, and cleaning the code before testing it and making it open source. To validate the algorithm and demonstrate that it fulfilled the objectives, I developed rare variant data simulators.

Having a fully fleshed out research plan and simulation study prior to the computational experiments and data collection was critical for being organized and successful. The team of professors and researchers I was working with had biweekly meetings (and I had weekly meetings with my PI) to review progress and discuss ideas.

a. Was that goal the result of any specific situation, experience, or problem you encountered?

Yes, I first noticed the gap in rare variant association analysis methods when approaching the problem of analyzing the relationship between HLA amino acid position mismatches and kidney transplantation failure. Simple univariate methods lacked statistical power (they essentially failed to provide informative results on the data) and all previous methods for binning rare variant features relied on expert knowledge and were restricted to genetics.

Discussions with my PI and further exploration of the literature underscored the need for a tool for rare variant association analysis that overcame the limitations of previous methods.

b. Were you trying to solve a problem, answer a question, or test a hypothesis?

I think all three. I was trying to solve the problem of overcoming the limitations of previous methods for rare variant association analysis by developing a method that not only could detect interactions and was flexible in terms of operating with or without expert knowledge but also could translate to fields beyond genetics. I was also trying to solve the problem of understanding the role of HLA amino acid mismatches in kidney transplantation (given the difficulty of doing so due to the multicollinearity and low frequency of mismatches at many HLA amino acid positions). Understanding kidney transplantation is especially important given the prevalence of transplantation failure and the increased risk of transplantation failure in third world countries.

I was trying to answer the following questions. (1) How can a rare variant association analysis tool take into account interactions between features? (2) How does a software developer balance reducing computational time/expense with algorithm accuracy in the context of evolutionary algorithms? (3) What can machine learning uncover in HLA and transplantation, and specifically, what are the most important amino acid positions and combinations/bins of amino acid positions for kidney transplantation failure? Other questions were specifically posed by collaborators who are immunologists, clinicians, and researchers studying HLA and transplantation. For example, what is the role of mismatches at the HLA-DRB1 locus vs. the HLA-DQB1 locus in graft failure (this is a very pertinent question with a lot of uncertainty)?

I was testing various hypotheses; I believed that the data-driven approach of our proposed RARE evolutionary algorithm would outperform expert knowledge methods, and I also sought to evaluate our immunologist and transplantation scientist collaborators’ hypothesis about the importance of HLA-DRB1 (over HLA-DQB1 and Class 1 HLA loci) for transplantation.

2. What were the major tasks you had to perform in order to complete your project?

The first steps were the conception and mapping out of the project. A crucial component of the project was the pseudo code I wrote out and got approved (this was akin to an outline for an essay). Next, I coded the RARE algorithm and then went through a debugging phase to shore up any errors or weaknesses in the algorithm. While starting to test the algorithm, I coded two rare variant data simulators and then set up a simulation study to validate various pertinent characteristics of the RARE algorithm (e.g., ability to construct bins with univariate association, ability to construct bins with epistatic interaction, and ability to use expert knowledge). After RARE was validated, I drafted the workshop paper for the Genetic and Evolutionary Computation Conference, and then proceeded to focus on applying RARE on the HLA amino acid and transplantation data. At this point, I also cleaned up the code, worked on code interpretability, and made the code open source on the lab’s GitHub. Various discussions with collaborators framed the best approach for analyzing transplantation failure with RARE. After applying RARE on the transplantation data, I wrote a draft of the abstract for the American Society for Histocompatibility and Immunogenetics annual meeting.

a. For teams, describe what each member worked on.

I’m submitting this project as an individual, so it’s not a team project. However, the project was done with the collaboration and guidance of a team of professors, including Dr. Ryan J. Urbanowicz (Penn Medicine, and now moving to Cedars Sinai), Dr. Malek Kamoun (Penn Medicine), and Dr. Loren Gragert (Tulane). I also worked with Grace L. Wager, a PhD student at Tulane. I was the only high school student or undergraduate involved in this project.

I led the conception of the algorithm with the guidance of my PI (Dr. Ryan Urbanowicz). I led all the development, coding, validation, and data analysis parts of the project (i.e., I did the vast majority of the work by myself and had support/guidance from my PI and collaborators). I also wrote the full drafts and prepared the figures for the two peer-reviewed publications from this project (a paper in the Genetic and Evolutionary Computation Conference and an abstract in the American Society for Histocompatibility and Immunogenetics annual meeting) and was at the heart of the revision/edits/submission for these publications (I am first author of both publications).

3. What is new or novel about your project?

The project’s aims (especially in overcoming the limitations of previous methods), the algorithm’s approach, and the application of the algorithm are all new and novel. The novelty is explained further below.

a. Is there some aspect of your project's objective, or how you achieved it that you haven't done before?

The development of an evolutionary algorithm is something I hadn’t done before; although I had used different machine learning methods, developing this type of algorithm with optimization inspired by natural selection was a great learning opportunity. I also was completely new to the HLA space and the field of transplantation. Finally, the use of statistics in this project (as well as ongoing further work) was new to me and I learned a lot about statistical inference throughout the process.

b. Is your project's objective, or the way you implemented it, different from anything you have seen?

The project’s use of an evolutionary algorithm as a binning strategy for rare variants is novel; this evolutionary algorithm approach takes a data-driven approach that constructs bins to optimize their association to outcome rather than making assumptions based on domain knowledge like previous methods do.

The algorithm’s options for univariate (chi-square) scoring and Relief (MultiSURF algorithm, sensitive to interactions) scoring is also different from most previous binning approaches and rare variant association analysis methods. This is because previous methods did not consider interactions (e.g., epistasis) between features or bins of rare variant features.

RARE’s flexibility for random initialization to start the construction of rare variant bins is also novel; all previous rare variant binning methods require expert knowledge input for initialization.

Finally, RARE is the first machine learning tool that can bin rare variant features from any problem domain, as long as feature values are encoded as 0s, 1s, and 2s. All previous methods were confined to genetics, but RARE expands rare variant feature binning to all fields given that there are features with rare variant states in nearly all sources of data. This project demonstrates the versatility of RARE to bin features beyond just rare genetic variants by applying RARE on donor-recipient HLA amino acid mismatches.

c. If you believe your work to be unique in some way, what research have you done to confirm that it is?

A literature review of 39 papers was conducted to inform the paper written on RARE, and I conducted an additional extensive search of various rare variant association methods and numerous evolutionary algorithms to confirm the novelty of the RARE concept. All notable previous rare variant association analysis methods were studied rigorously and numerous reviews on rare genetic variants were studied to further confirm the uniqueness of the ideas in this project. The novelty of a rare variant binning method being applied to HLA amino acid mismatches and transplantation was confirmed by an extensive literature review of prediction and association analysis methods.

4. What was the most challenging part of completing your project?

The most challenging part of the project was probably debugging RARE after I initially coded it; evolutionary algorithms rely on a lot of moving parts and any small error, conceptual or code bug, results in an exponential effect due to the nature of evolutionary algorithms. The rare variant data simulators used to validate RARE were also particularly difficult to conceive, plan, and execute.

a. What problems did you encounter, and how did you overcome them?

After developing RARE, the debugging phase presented various complications and hurdles to overcome. There were various issues with the ‘evolutionary pressure’ in RARE’s evolutionary algorithm that led me to change the probabilities of crossover and mutation (parts of the genetic operators carried out during each evolutionary cycle of RARE). There was also an issue with losing strong bins across evolutionary cycles that led me to integrate elitism (top bins in each generation are preserved for the next generation while poor bins are automatically eliminated). I also upgraded the type of crossover from one-point crossover to uniform crossover. Another change I made during this debugging phase was adding a random seed to ensure reproducibility (for both RARE and its rare variant data simulators).

My computer lacked the computational power to carry out the many replicates required to fully validate RARE (30 replicates of 9 experiments). I overcame this issue by remotely connecting to a supercomputer at Penn Medicine and running trials there.

The rare variant data simulators were quite “meta” and difficult to code because I had to develop methods to indirectly create bins with either univariate or epistatic associations to outcome. The bins were additive combinations of features, so I had to find a way to assign feature values such that additive bins had univariate or epistatic associations to outcome (this indirect approach required a lot of thinking and planning; I found it to be a great computational learning experience).

b. What did you learn from overcoming these problems?

One thing I learned is that it’s important to constantly refer back to the scientific literature when working on a project. Often, after completing a literature review and a research plan, one may feel that the background study is complete, but continuing to consult with primary literature and review articles can help guide approaches to problem solving during a project (particularly a computational one).

I also learned about the importance of reproducibility (especially when dealing with a large number of trials such as the replicates of experiments carried out during validation of RARE) and documentation (especially when making code open source).

5. If you were going to do this project again, are there any things you would do differently the next time?

If I were to do the RARE project again, I’d look into other forms of optimization to construct bins. Bayesian optimization and gradient descent would be options. I believe evolutionary algorithm optimization was the most suitable because of the clear units making up components to be optimized (in this case rare variant features being parts of bins of rare variant features), but it might be worth considering other forms of optimization.

From what I learned over the course of the project, I would probably start with a sci-kit learn framework. A future goal that my collaborators and I have is to make the RARE algorithm easily utilizable in a sci-kit learn package. Currently the RARE algorithm is made open source on GitHub and the goal is to transform it into a sci-kit learn package, but perhaps it could have been initially coded as a sci-kit learn package (though this would make initial testing slightly cumbersome).

I would also improve the selection of RARE’s hyperparameters (mutation probability, crossover probability, and elitism proportion). These parameters were selected based on what is considered normal in evolutionary algorithms (0.2-0.4 for elitism, 0.8 for crossover, 0.1 for mutation), but if I could do the project again, I’d implement a hyperparameter grid search to ensure the strongest hyperparameters are chosen for the algorithm.

6. Did working on this project give you any ideas for other projects?

Working on this project gave me many ideas for further projects with the RARE algorithm as well as HLA amino acid mismatches and transplantation. Many of these future ideas are currently under development. First, I am working on integrating additional scoring metrics to RARE (mutual information and rank-biserial correlation) in addition to chi-square scoring and MultiSURF scoring. Different scoring metrics are relevant in different problem domains as they each have pros/cons in interpretability, efficiency, ability to capture directionality, and ability to capture nonlinear relationships. Also, I’m currently running RARE analyses on the transplantation data for transplantation failure within one year and beyond one year, as it is hypothesized that different amino acid positions (and thus different bins of amino acid positions) are important when looking at transplantation survival within one year vs. beyond one year. My collaborators have recently expressed further interest in interactions in HLA amino acid mismatches, so I have been developing and testing methods such as Association Rule Mining, Multifactor Dimensionality Reduction, and various new statistical methods I’ve conceived with my PI.

7. How did COVID-19 affect the completion of your project?

The coronavirus pandemic shifted my employment as a research intern at the Department of Biostatistics, Epidemiology, and Informatics at the Perelman School of Medicine at University of Pennsylvania to a virtual format. Since the project primarily involved computer programming, this was not a major issue, though opportunities for in-person work were cut off and confined the project to a certain degree by limiting my involvement in wet lab work (visiting students and researchers were and still are not allowed to be in-person at the vast majority of universities).