Using Machine Learning tools to Predict Air Quality Levels

All codes using in this project can be found here

Introduction

Objective:

This research aims to uncover trends and influencing factors in pollution and Air quality data and be able to accurately predict air quality level based on pollutant, geographic, and climatic factors.

Research Question: How can we predict air quality levels and identify the key factors influencing air pollution?

Data And Key Insights

Input Variables

The input variables are:

•Concentrations (PM2.5, PM10, NO2, SO2, and CO)

•Temperature

•Humidity

•Population density

•Proximity to industrial areas.

The target variable of air quality classifies air quality into 4 levels: Good, Moderate, Poor, and Hazardous.

Exploratory Data Analysis

•Good: Clean, low pollution

•Moderate: Acceptable, but some pollutants present.

•Poor: Pollution, may affect sensitive groups.

•Hazardous: Severe pollution, serious health risks

There is an association between proximity to industrial areas and Air quality, with higher distances away from industrial areas being associated with better Air quality outcomes. We also noticed a clear trend in humidity, with high humidity being associated with worse air quality outcomes.

Approach

We tested different predictive models to maximize accuracy of predictions. We were predicting 4 levels of air quality: Good, Moderate, Poor, and Hazardous.

We first tried a classification method called logistic regression where we grouped air quality levels into just Acceptable or Unacceptable

We ran this model to understand what variables have the biggest effects on Air quality. Although other models offer the advantage of being able to predict at a granular level, they make interpreting individual variable effect more challenging.

Next, we ran an LDA model predicting the 4 levels of air quality. Additionally, we built a KNN model at the same level of classification as a reference point to compare to the LDA.

Results

After testing different predictive models, we decided that the best model was the LDA, which predicted air quality level to 4 levels with an accuracy of 93.95% and sensitivity of 98.23%.

The grouped analysis (logistic regression model) yielded an overall accuracy of 96.70%

The KNN model gave us a 92.6% accuracy but was not good at predicting Hazardous and poor classes.

Conclusion

Effective prediction & Recommendations

Our models effectively predicted air quality levels, showcasing their reliability in identifying data patterns. Policy recommendations include reducing CO and NO2 emissions via cleaner energy, stricter vehicle standards, and improved industrial processes. Urban planning should prioritize buffer zones to separate industrial areas from residential zones, as proximity to industries strongly influences air quality.

We also recommend promoting green infrastructure and urban greening projects to mitigate pollution effects in densely populated areas. '

One other potential application for this research could be some sort of predictive map interface, either a website or an app, where users could see the predicted air qualities for their location in a 7-day forecast. This would allow users with health concerns or users in areas with poor air quality to be more informed and make decisions in their daily lives.