Available Projects for RET
Research Experiences in Big Data and Machine/Deep Learning for OK STEM Teachers.
Dr. Rittika Shamsuddin
Work/Research accomplished by group during last summer:
To identify the key features of high or low case values, several methods of interpretability
were utilized as follows: forward feature selection, backward feature selection, logistic
regression (LR) and decision tree (DT) weights on the original data set and predictions,
and Shapley (SHAP) and LIME values on two black box models, XGboost Classifier (XGB)
and Resnet. For the different classifiers, the models were trained and ran over the
pre-processed data. XGB and RC were then ran over the full data after being trained
on a portion of it to generate the predictions used for additional LR and DT models.
All of those results were used to determine whether the features of importance would
be agreed upon or if the current methods of interpretability fail when faced with
large amounts of features.
The dataset is collected from Kaggle, where it was generated from a notebook that
took data from the New York Times COVID-19 county level cases and fatalities, the
2016 CDC Social Vulnerability Data, 2020 Community Health Rankings Data, NOAA Global
Surface Summary of the Day for weather data, and Kaiser Family Foundation Data for
state level stay-at-home orders. The data was entered for each day the different 3,142
US counties had recorded COVID-19 cases. The features can be broken down into three
main categories, socio-economic, weather, and personal health. When determining the
classifications, 174 cases was determined as the cutoff value. 174 was both the median
value of all the cases and almost perfectly balanced the data with a 49.9/50.1% split
between less than or equal to 174 cases (0 or “low” cases) and greater than 174 cases
(1 or “high” cases).
Sample Projects:
Project 1: Track Mutations to Predict Trajectory of Virus Evolution and Virulence
Project Classification: Healthcare Informatics, Data Mining, Interdisciplinary Research
Project Description: Since women already have a high participation in biological/psychological sciences,
by combining healthcare research with computer science, we can promote high school
female students to explore computer science and data science as viable career options.
The methods used in this work will involve data collection from the public domain, introduction
to computational tools such BLAST (for local sequence alignment), RCSB Protein Data
Bank (to explore protein structures and conformation), and other exploratory tools
from the literature such as GenomeTool (http://genometools.org/index.html). The use of these tools will be complemented with hands-on computational exercises,
such understanding (and possibly implementing) edit distance metrics for string comparison.
The purpose of this research project is to i) understand DNA (deoxyribonucleic acid) mutations in virus and their subsequent effects on RNA (ribonucleic acid) sequences and protein structures, and ii) explore the parameters of virulence
and use the parameters to quantify virulence. The end result will be an in-depth study
of mutation and virulence history of the chosen virus. Understanding the series of
mutations that allows a virus to change its virulence (and sometimes its hosts), will
enrich our prediction abilities about epidemics and pandemic and to update prevention
mechanism against viral global threats.
Project 2: Automating Analysis of Electronic health records (EHR)
Project Classification: Data Analysis, Analytical Method, Data Collection
Project Description: The methods of this project will involve cleaning and preprocessing a very large, real-world EHR dataset. It will also include the use of analytical tools such as unsupervised learning (kmeans++, hierarchical clustering) to develop a patient profiling system/framework. The validity of the framework will be verified by studying the improvement in prediction accuracies of simple machine learning techniques when trained on the patient profiles generated by the system.
The first few weeks will be spent on brief lessons/training on the basics of supervised
and unsupervised machine learning. The next few weeks will focus on handling missing
data, and then the remaining weeks will focus on putting the building blocks from
the previous weeks together to create the analytical system and the validation toolset.
Project 3: Understanding Deep Neural Network
Project classification: Data Collection, Data Analytics, Deep Networks
Project Description: In this project, we will use simple deep neural networks (mainly dense and convolution networks)
to visualize the concepts and feature maps they have learnt. We will then develop
an analytical simulation system, where systematic data perturbation techniques will be used to study the correlation
between the learnt concepts and the data variables.
Despite great performance, the state-of-the-art dep learning models are so complex
that their inner workings are not comprehensible to humans. However, some industries
such as the healthcare industry, require interpretable solutions that can easily translate
into treatment strategies. Thus, in this project we develop a simulation-based technique
to understand how the concepts learnt by simple deep network models relate to the
raw and understandable variables in the data.
Project 4: Preserving Healthcare Time-Series Data Privacy through Data Synthesis
Project Classification: Data Analysis, Privacy Preservation
Project Description: In this project, we will develop a system for synthesizing healthcare time-series data, that relies on evolutionary algorithms,
such as Genetic Algorithm. The developed system can then facilitate research in clinical informatics, while preserving patient privacy.
Patient data privacy is an important factor that plays a crucial role is data analysis.
Various data perturbation/encryption/sanitization techniques have been used to promote
privacy, and the trade-off between accuracy and privacy is a huge research topic.
Thus, in this project we will use a genetic algorithm-based data synthetic technique,
to explore how synthetic data can be used to preserve patient privacy with reduced
loss in accuracy.
Dr. Arunkumar Bagavathi
General Description of his projects:
(a) Learning fair representations of news articles: Machine learning models trained on news articles will be biased on political leaning of news domains, editor, or the writing style. In this research we examine ways to eliminate such learning/algorithm bias by incorporating the knowledge collected from external trusted data sources like congress speeches and political debates. We extract entity level information from our trusted resources and global information about news domains from Wikipedia articles to add more contextual insights to news article representations. We show with our experiments that the proposed model outperforms existing methods and we also show that the proposed model can mitigate algorithmic bias in the representation learning process
(b) Characterizing animal genome sequences with graph representation learning: Identifying bacterial and viral sequences from a genome sequence with machine learning methods is a challenging task given the scale of genome sequences. In this research we express the given animal metagenome sequences in a De-Bruijn graph and learn representations with graph representation learning algorithms. With the representations learned just from the graphs, we show that we can match the performance of state-of-the-art methods that utilize only genome sequences
Work/Research accomplished by group during last summer: One conference paper is written by the group along with a Ph.D. student and it is currently under review.
Dr. Sharmin Jahan
General Project Idea:
Enhancing security awareness in autonomous system by the application of Explainable AI model
Research in the autonomous system is gaining popularity due to its inherent advantages of improving its resiliency, which operates in a dynamic environment. The main advantage of an autonomous system is it adjusts functionalities without or with minimal human intervention, which we can call adaptation. But the adaptation has the potential to introduce new security vulnerabilities in the system. So, security has become one of the primary concerns for the autonomous system. The system should have knowledge about its security profile, interpret the operational environment’s security state, and assess the ability to choose the best adaptation for the system. But the dynamic environment includes uncertainty, which is not always possible to predict in prior, and thus pre-defined rules are not enough for interpreting the environment. This research aims to embed security awareness by interpreting dynamic operational environment with potential uncertainty using explainable AI model, which compliments the choice of optimal adaptation to continue its operation and maintain its security capabilities. The RET teachers will explore challenges related to explainable AI in security for different domain applications and potential solutions to resolve the challenge.
Research Accomplishments Last year:
Published a conference paper as an outcome of the research work:
Abulfaz Hajizada, Sharmin Jahan. (2023). Feature Selections for Phishing URLs Detection
Using
Combination of Multiple Feature Selection Methods. In Proceeding of 2023 15th International
Conference on Machine Learning and Computing (ICMLC)