Department of Computer Science

Available Projects for RET

Research Experiences in Big Data and Machine/Deep Learning for OK STEM Teachers.

Dr. Rittika Shamsuddin

shamsuddin headshot

Work/Research accomplished by group during last summer:
To identify the key features of high or low case values, several methods of interpretability were utilized as follows: forward feature selection, backward feature selection, logistic regression (LR) and decision tree (DT) weights on the original data set and predictions, and Shapley (SHAP) and LIME values on two black box models, XGboost Classifier (XGB) and Resnet. For the different classifiers, the models were trained and ran over the pre-processed data. XGB and RC were then ran over the full data after being trained on a portion of it to generate the predictions used for additional LR and DT models. All of those results were used to determine whether the features of importance would be agreed upon or if the current methods of interpretability fail when faced with large amounts of features.

The dataset is collected from Kaggle, where it was generated from a notebook that took data from the New York Times COVID-19 county level cases and fatalities, the 2016 CDC Social Vulnerability Data, 2020 Community Health Rankings Data, NOAA Global Surface Summary of the Day for weather data, and Kaiser Family Foundation Data for state level stay-at-home orders. The data was entered for each day the different 3,142 US counties had recorded COVID-19 cases. The features can be broken down into three main categories, socio-economic, weather, and personal health. When determining the classifications, 174 cases was determined as the cutoff value. 174 was both the median value of all the cases and almost perfectly balanced the data with a 49.9/50.1% split between less than or equal to 174 cases (0 or “low” cases) and greater than 174 cases (1 or “high” cases).

Sample Projects:

Project 1: Track Mutations to Predict Trajectory of Virus Evolution and Virulence

Project Classification: Healthcare Informatics, Data Mining, Interdisciplinary Research

Project Description: Since women already have a high participation in biological/psychological sciences, by combining healthcare research with computer science, we can promote high school female students to explore computer science and data science as viable career options. The methods used in this work will involve data collection from the public domain, introduction to computational tools such BLAST (for local sequence alignment), RCSB Protein Data Bank (to explore protein structures and conformation), and other exploratory tools from the literature such as GenomeTool (http://genometools.org/index.html). The use of these tools will be complemented with hands-on computational exercises, such understanding (and possibly implementing) edit distance metrics for string comparison. The purpose of this research project is to i) understand DNA (deoxyribonucleic acid) mutations in virus and their subsequent effects on RNA (ribonucleic acid) sequences and protein structures, and ii) explore the parameters of virulence and use the parameters to quantify virulence. The end result will be an in-depth study of mutation and virulence history of the chosen virus. Understanding the series of mutations that allows a virus to change its virulence (and sometimes its hosts), will enrich our prediction abilities about epidemics and pandemic and to update prevention mechanism against viral global threats.

Project 2: Automating Analysis of Electronic health records (EHR)

Project Classification: Data Analysis, Analytical Method, Data Collection

Project Description: The methods of this project will involve cleaning and preprocessing a very large, real-world EHR dataset. It will also include the use of analytical tools such as unsupervised learning (kmeans++, hierarchical clustering) to develop a patient profiling system/framework. The validity of the framework will be verified by studying the improvement in prediction accuracies of simple machine learning techniques when trained on the patient profiles generated by the system.

The first few weeks will be spent on brief lessons/training on the basics of supervised and unsupervised machine learning. The next few weeks will focus on handling missing data, and then the remaining weeks will focus on putting the building blocks from the previous weeks together to create the analytical system and the validation toolset.

Project 3: Understanding Deep Neural Network

Project classification: Data Collection, Data Analytics, Deep Networks

Project Description: In this project, we will use simple deep neural networks (mainly dense and convolution networks) to visualize the concepts and feature maps they have learnt. We will then develop an analytical simulation system, where systematic data perturbation techniques will be used to study the correlation between the learnt concepts and the data variables.
Despite great performance, the state-of-the-art dep learning models are so complex that their inner workings are not comprehensible to humans. However, some industries such as the healthcare industry, require interpretable solutions that can easily translate into treatment strategies. Thus, in this project we develop a simulation-based technique to understand how the concepts learnt by simple deep network models relate to the raw and understandable variables in the data.

Project 4: Preserving Healthcare Time-Series Data Privacy through Data Synthesis

Project Classification: Data Analysis, Privacy Preservation

Project Description: In this project, we will develop a system for synthesizing healthcare time-series data, that relies on evolutionary algorithms, such as Genetic Algorithm. The developed system can then facilitate research in clinical informatics, while preserving patient privacy.
Patient data privacy is an important factor that plays a crucial role is data analysis. Various data perturbation/encryption/sanitization techniques have been used to promote privacy, and the trade-off between accuracy and privacy is a huge research topic. Thus, in this project we will use a genetic algorithm-based data synthetic technique, to explore how synthetic data can be used to preserve patient privacy with reduced loss in accuracy.

Arun Headshot

Dr. Arunkumar Bagavathi

General Description of his projects:

(a) Learning fair representations of news articles: Machine learning models trained on news articles will be biased on political leaning of news domains, editor, or the writing style. In this research we examine ways to eliminate such learning/algorithm bias by incorporating the knowledge collected from external trusted data sources like congress speeches and political debates. We extract entity level information from our trusted resources and global information about news domains from Wikipedia articles to add more contextual insights to news article representations. We show with our experiments that the proposed model outperforms existing methods and we also show that the proposed model can mitigate algorithmic bias in the representation learning process

(b) Characterizing animal genome sequences with graph representation learning: Identifying bacterial and viral sequences from a genome sequence with machine learning methods is a challenging task given the scale of genome sequences. In this research we express the given animal metagenome sequences in a De-Bruijn graph and learn representations with graph representation learning algorithms. With the representations learned just from the graphs, we show that we can match the performance of state-of-the-art methods that utilize only genome sequences

Work/Research accomplished by group during last summer: One conference paper is written by the group along with a Ph.D. student and it is currently under review.

Dr. Sharmin Jahan

General Project Idea:

Enhancing security awareness in autonomous system by the application of Explainable AI model

Research in the autonomous system is gaining popularity due to its inherent advantages of improving its resiliency, which operates in a dynamic environment. The main advantage of an autonomous system is it adjusts functionalities without or with minimal human intervention, which we can call adaptation. But the adaptation has the potential to introduce new security vulnerabilities in the system. So, security has become one of the primary concerns for the autonomous system. The system should have knowledge about its security profile, interpret the operational environment’s security state, and assess the ability to choose the best adaptation for the system. But the dynamic environment includes uncertainty, which is not always possible to predict in prior, and thus pre-defined rules are not enough for interpreting the environment. This research aims to embed security awareness by interpreting dynamic operational environment with potential uncertainty using explainable AI model, which compliments the choice of optimal adaptation to continue its operation and maintain its security capabilities. The RET teachers will explore challenges related to explainable AI in security for different domain applications and potential solutions to resolve the challenge.

Research Accomplishments Last year:

Published a conference paper as an outcome of the research work:

Abulfaz Hajizada, Sharmin Jahan. (2023). Feature Selections for Phishing URLs Detection Using
Combination of Multiple Feature Selection Methods. In Proceeding of 2023 15th International Conference on Machine Learning and Computing (ICMLC)