Mercer Science and Engineering Fair | Fostering STEM Education in Mercer County, NJAffiliated with Regeneron ISEF and Thermo Fisher Junior Innovators Challenge

Star, galaxy, quasar and star spectral types classification with broadband photometry

Fair: 2021 Mercer Science and Engineering Fair

Event: Senior Division 2021

Category: Mathematics, Physics and Astronomy

Student: Zhixin Wang

Table: MATH1704

Experimentation location: No experiment

Regulated Research (Form 1c): No

Project continuation (Form 7): No

Abstract:

We combine data from Stellar Abundance and Galactic Evolution(SAGE) survey and The Large Sky Area Multi-Object Fiber Spectroscopic Telescope
(LAMOST) survey, in total 704k sources, to train a random forest classifier for separating stars, galaxies, quasars and assigning spectral types for stars. We select all sources in the photometric catalog up to g<19 that are matched by a spectra in the LAMOST survey. The model is evaluated on the validation set and the optimal hyperparameters are chosen. Overall the most important feature for separating stars from galaxies and quasars is color $v$-$g$ and the most important feature for the classification of stars is $g$-$i$. We further investigate the misclassified samples by inspecting their spectra from LAMOST survey and image from Sloan Digital Sky survey. The accuracy of the final model on the test set, after removing contaminated sources with incorrect labels, is 99.54\%. The $F_1$ scores for star,galaxy and quasar are 0.998, 0.935 and 0.770, respectively. As for the classification of stars, the overall accuracy is 82.09\%. Potential limitations of the model resulted from the lack of training samples for particular class of stars, proximity of stars and certain quasars in the color space, lack of morphological information, limitations of a 2-dimensional dust map, redshifts of SED and sample selection bias.

Bibliography/Citations:

[1] Bolton, A., Schlegel, D., Aubourg, É., Bailey, S., Bhardwaj, V., Brownstein, J., Burles, S., Chen, Y., Dawson, K., Eisenstein, D., Gunn, J., Knapp, G., Loomis, C., Lupton, R., Maraston, C., Muna, D., Myers, A., Olmstead, M., Padmanabhan, N., Pâris, I., Percival, W., Petitjean, P., Rockosi, C., Ross, N., Schneider, D., Shu, Y., Strauss, M., Thomas, D., Tremonti, C., Wake, D., Weaver, B. and Wood-Vasey, W., 2012. SPECTRAL CLASSIFICATION AND REDSHIFT MEASUREMENT FOR THE SDSS-III BARYON OSCILLATION SPECTROSCOPIC SURVEY. The Astronomical Journal, 144(5), p.144.

[2] Logan, C. and Fotopoulou, S., 2020. Unsupervised star, galaxy, QSO classification. Astronomy & Astrophysics, 633, p.A154.

[3] Kovács, A. and Szapudi, I., 2015. Star–galaxy separation strategies for WISE-2MASS all-sky infrared galaxy catalogues. Monthly Notices of the Royal Astronomical Society, 448(2), pp.1305-1313.

[4] Breiman, L., 2001. Machine Learning, 45(1), pp.5-32.

[5] Zheng, J., Zhao, G., Wang, W., Fan, Z., Tan, K., Li, C. and Zuo, F., 2019. Test area of the SAGE survey. Research in Astronomy and Astrophysics, 19(1), p.003.

Additional Project Information

Project website: -- No project website --

Presentation files:

Zhixin Wang - Scientific Project.pptx

Research paper:

Research paper - Zhixin Wang.pdf

Additional Resources: -- No resources provided --

Project files:

Project files

Presentation files

Zhixin Wang - Scientific Project.pptx

Research paper

Research paper - Zhixin Wang.pdf

Research Plan:

A. Rationale

Astronomical surveys create massive volume of data, which requires efficient and robust algorithm for automated identification, classification, and parameter estimation of sources. The classification of sources is a fundamental step for further studies into the data, such as stellar parameter estimation or photometric redshift calculation. There are multiple well-developed methods for the classification of spectroscopic data by fitting emission and absorption lines. (e.g [1]) The classification of multi-wavelength photometric data, however, is a more challenging problem since that the magnitude measurements only capture the overall shape of the spectrum. Previous studies have utilized different unsupervised and supervised machine learning methods for classifying sources, such as [2] and [3]. This projects will focus on testing machine learning algorithms such as randomforest [4] on the photometric data from the Stellar Abundance and Galactic Evolution survey. (SAGE) [5]

B. Research Questions

0. How to collect training samples for the supervised learning algorithm?

1. Which machine learning algorithm is most suited for the classification of stars, galaxies, and quasars?

2. What pre-processing on the raw data(magnitudes) are necessary to build a model that accurately predicts the labels(e.g correct on extinction)

3. For which type of object would the model have higher/lower predictive power? What are the explanations for this in astronomy? (i.e. related to the physical nature of those objects)

4. What are the predictors that can be included in a model, including but not limited to those given in the dataset?

C. Description in detail of method or procedures

1. Retrieve data from the SAGE survey and one spectroscopic survey(e.g SDSS or LAMOST)

2. Cross-match the sources and select those that match with another labeled source in the spectrum dataset.

3. Train, tune, and optimize models using the labels and photometric data from SAGE

4. Test the final model on the test set and measure its effectiveness

5. Analyze the outputs of the model(labels, probability... etc.) and identify the possible causes of misclassification.

D. Bibliography

[2] Logan, C. and Fotopoulou, S., 2020. Unsupervised star, galaxy, QSO classification. Astronomy & Astrophysics, 633, p.A154.

[4] Breiman, L., 2001. Machine Learning, 45(1), pp.5-32.

[5] Zheng, J., Zhao, G., Wang, W., Fan, Z., Tan, K., Li, C. and Zuo, F., 2019. Test area of the SAGE survey. Research in Astronomy and Astrophysics, 19(1), p.003.

Questions and Answers

1. What was the major objective of your project and what was your plan to achieve it?

The major objective was to explore using machine learning methods to classify photometric data. My plan is to download and retrieve data from two sky surveys and analyze them.

a. Was that goal the result of any specific situation, experience, or problem you encountered?

No.

b. Were you trying to solve a problem, answer a question, or test a hypothesis?

I'm trying to solve this problem of classifying stars, galaxies and quasars.

2. What were the major tasks you had to perform in order to complete your project?

a. For teams, describe what each member worked on.

1) Retrieving and combing different datasets

2) Processing and visualizing data

3) Fitting different machine learning models to the data and tune the hyperparameters

4) Test the performance of the final model, analyze the outputs of the model.

5) Use other tools to study the misclassified samples and find possible improvements for the model

3. What is new or novel about your project?

a. Is there some aspect of your project's objective, or how you achieved it that you haven't done before?

Before I did this project, when I find one object from a sky survey, (e.g on SDSS sky server) I would manually classify them based on their shape or measured spectrum. But in this project I learned to use machine learning algorithms to do this task which is usually more efficient and accurate, and my algorithm doesn't require any spectroscopic measurement.

b. Is your project's objective, or the way you implemented it, different from anything you have seen?

My original plan was to compare different algorithms on the dataset, (supervised, unsupervised, ... etc.) but there are several difficulties so I changed the objective to only using one method. Another creative aspect of my method is that I closely analyzed several factors that might lead to misclassification in the model and highlighted on possible improvements.

c. If you believe your work to be unique in some way, what research have you done to confirm that it is?

I have read several previous studies on this topic(including those cited above) and I have to say that some of them are very similar to mine. Therefore, my original objective was to compare multiple methods and possibly use a more novel deep learning method. However, as I will explain in the next section, I was only able to implement one method. Still, I think my work is unique in two ways: it combines star, galaxy quasar and stellar spectral type classification which provides more information on the stars observed in the survey; it further investigates most of the misclassified samples using other data sources to verify the model.

4. What was the most challenging part of completing your project?

a. What problems did you encounter, and how did you overcome them?

1）The first problem I encountered immediately after I begin working on this project is source cross-matching. As explained in the previous sections, I need to match each source in the SAGE catalog with their corresponding spectra(and label) in the LAMOST catalog, based on the coordinates. Initially, I wrote a simple algorithm that compares the coordinates of each pair of points, resulting in a O(N<sup>2</sup>) runtime.(where N is the approximate number of sources in both catalogs) However, this method turns out to be too slow for the large dataset, so I used another data structure known as R tree to efficiently query and match those points in O(N logN) runtime.

2) Another problem I encountered is also related to runtime efficiencies. Among the three machine algorithms I tried on the dataset - light gradient boosting(lightgbm), extreme gradient boosting(xgboost) and randomforest, none of them could finish running in a reasonable amount of time(<=8 hours) on the entire dataset. After doing more research into the R packages, I found ranger - a very efficient R implementation of the randomforest algorithm that could finish running in around one hour. However, I can't find any more efficient implementations of the boosting algorithms so my research only used one machine learning method - random forest.

b. What did you learn from overcoming these problems?

I learned about the applications of algorithms and data structures in astronomy for processing big data. In fact, I found one article particularly about the application of a similar algorithm in astronomy. (https://arxiv.org/abs/0801.2004) I also learned that the efficiency of computer programs is very important for research, especially fields like astronomy that can often produce massive amounts of data. It could also be a good idea to create machine learning algorithm packages specifically for astronomical data.

5. If you were going to do this project again, are there any things you would you do differently the next time?

If I were to do this project again, I would definitely like to add more novel elements into it. There are two possible ways to do it. First, I can try to compare different methods(ML or non-ML) on the classification of photometric data and include more features into the model, such as morphological parameters. To do so, I need to match the SAGE catalog with more spectroscopically labeled datasets to increase the amount of training data. I can also investigate the importance of those parameters for separating different class of objects. The second way is to focus on the original data(images) instead of the extracted magnitude values. For example, I can implement a convolutional neural network(CNN) to classify objects based on photometry. (image) Compared with the traditional ML methods, this deep learning method omits the need for source extraction or parameter estimation, and could exploit more information from the original images.

6. Did working on this project give you any ideas for other projects?

Yes. Through working on this project, I learned about photometry and spectroscopy and some basic observational astrophysics, which encourages me to work on other projects in astrophysics. Currently, I'm working on a project related to photometric redshift(i.e. estimating redshifts from photometric data) and large scale structure. (clusters, filaments...etc.) This is also related this project on classification since that it has been shown that classifying sources prior to photometric redshift estimation.(https://arxiv.org/pdf/1808.04977)

7. How did COVID-19 affect the completion of your project?

COVID-19 hasn't affect the completion of my project too much. The data in my research is analyzed on my laptop and is downloaded through the internet.