EDA Assignment Sheet
Introduction
Now that you have seen some of the introductory data skills in R, it is time to put them to use. This assignment consists of 10 problems where you will download quantitative datasets from various sources. You will then analyze the data by creating different visualizations such as scatterplots, histograms, and bar plots. You will also calculate means, medians, standard deviations, five-number summaries, and regression lines. Some datasets will be continuous and some will be discrete.
Problem 1: Gender Pay Gap Analysis
Dataset: Gender Pay Gap Data
Description: This dataset contains information on the gender pay gap across different industries. The variables include industry, median pay for men, and median pay for women (continuous variables).
Tasks:
- Calculate the mean and median pay for men and women.
- Create a barplot to visualize the median pay for men and women across different industries.
- Calculate the pay gap (difference between men’s and women’s pay) for each industry.
- Create a histogram of the pay gap.
- Calculate a 95% confidence interval for the average pay gap.
Problem 2: Racial Disparities in Incarceration Rates
Dataset: US Incarceration Rates
Description: This dataset contains the incarceration rates per 100,000 people for different racial groups (discrete variables).
Tasks:
- Calculate the mean and standard deviation of incarceration rates for each racial group.
- Create a barplot to visualize the incarceration rates for each racial group.
- Calculate a 95% confidence interval for the average incarceration rate for each racial group.
Problem 3: Access to Education by Region
Dataset: Global Education Data
Description: This dataset contains information on the average years of schooling for different regions (continuous variable).
Tasks:
- Calculate the mean and median years of schooling for each region.
- Create a histogram of the years of schooling.
- Calculate a 95% confidence interval for the average years of schooling for each region.
Problem 4: Unemployment Rates by Race and Gender
Dataset: US Unemployment Data
Description: This dataset contains unemployment rates for different racial and gender groups (discrete variables).
Tasks:
- Calculate the mean and standard deviation of unemployment rates for each racial and gender group.
- Create a faceted scatterplot to visualize the unemployment rates for each racial and gender group over time.
- Calculate a 95% confidence interval for the average unemployment rate for each racial and gender group.
Problem 5: Income Inequality by State
Dataset: US Income Inequality Data
Description: This dataset contains the Gini coefficient for income inequality by state (continuous variable).
Tasks:
- Calculate the mean and median Gini coefficient for the states.
- Create a histogram of the Gini coefficients.
- Calculate a 95% confidence interval for the average Gini coefficient.
Problem 6: Food Insecurity Rates by County
Dataset: USDA Food Insecurity Data
Description: This dataset contains the percentage of households experiencing food insecurity by county (discrete variable).
Tasks:
- Calculate the mean and standard deviation of food insecurity rates for the counties.
- Create a barplot to visualize the food insecurity rates by county.
- Calculate a 95% confidence interval for the average food insecurity rate.
Problem 7: Environmental Pollution and Health Outcomes
Dataset: EPA Air Quality Data
Description: This dataset contains air quality index (AQI) values and asthma rates for different regions (continuous variables).
Tasks:
- Calculate the mean and median AQI and asthma rates for each region.
- Create a scatterplot to visualize the relationship between AQI and asthma rates.
- Calculate the correlation coefficient between AQI and asthma rates.
- Fit a linear regression model to predict asthma rates based on AQI.
- Create a residual plot for the regression model.
- Calculate a 95% confidence interval for the slope of the regression line.
Problem 8: Access to Clean Water
Dataset: WHO/UNICEF Joint Monitoring Programme for Water Supply, Sanitation and Hygiene
Description: This dataset contains the percentage of the population with access to clean water by country (discrete variable).
Tasks:
- Calculate the mean and standard deviation of access to clean water rates for the countries.
- Create a barplot to visualize the access to clean water rates by country.
- Calculate a 95% confidence interval for the average access to clean water rate.
Problem 9: Child Mortality Rates by Region
Dataset: UNICEF Child Mortality Data
Description: This dataset contains child mortality rates (deaths per 1,000 live births) for different regions (continuous variable).
Tasks:
- Calculate the mean and median child mortality rates for each region.
- Create a histogram of the child mortality rates.
- Calculate a 95% confidence interval for the average child mortality rate.
Problem 10: Literacy Rates by Gender
Dataset: UNESCO Literacy Data
Description: This dataset contains literacy rates for males and females in different countries (discrete variables).
Tasks:
- Calculate the mean and standard deviation of literacy rates for males and females.
- Create a faceted scatterplot to visualize the literacy rates for males and females by country.
- Calculate a 95% confidence interval for the average literacy rate for males and females.