I am interested in education data and predicting the success of students in better ways hopefully to reduce the amount of testing (sampling) that is needed for each student, allowing them and their teachers to spend more time on preductive learning instead of time test taking or preparing for said tests. The data set I chose to explore is the “Regulatory Adjusted Four-Year Cohort Graduation Rates, School Year 2011-12” datset from EDFact, an initiative set forth by the U.S. Department of Education to “put performance data at the center of policy, management, and budget decisions for all K-12 educational programs”. [1]


Loading and Cleaning The Data

The data has some values that need changing for my initial exploratory purposes. These values arise due to the need to protect student information under the Family Educational Rights and Privacy Act (FERPA). For instance, in order to maintain student anonymity, small samples (5 students or less) have their values omitted (replaced with “PS”). I will be ignoring these cells in calculations that use them. Also, medium sized samples (from 6 to 300 students) have varying degrees of “blurring”, where exact graduation percentages are not reported, but ranges are used instead. I replace these ranges with point estimates that are normally distributed, centered around the range’s midpoint, but never allowed to extend past the range’s maximum or minimum.

Additionally, since I’m only looking at the 2011-2012 school year, the 1112 identifier on many of the column names is unnecessary, so I rename those columns, removing that portion of the name. E.g., “ALL_COHORT_1112” becomes “ALL_COHORT”.

Note: All of the above changes were preprocessed and saved into the “graduation_rate_data.csv” file. To see how these calculations were performed, see my raw code file titled “P4_scratchpad.R”.

data <- read.csv("graduation_rate_data.csv")

#Some of the schools were resporting way more students than the others.
#I looked up the largest ever graduating class in the U.S. 
#and it was 1561 students, so I removed all rows with cohorts over 1600.
data <- data %>% filter(ALL_COHORT<1600)

I’m interested in what factors could be affecting the graduation rates, as well. So I found another data set[2] that has school data for pupil/teacher ratios and revenue data, which I suspect influence graduation rates.

school_districts <- read.csv("school_district_data.csv")

#Combine extra data from school_districts into main dataframe
data <- left_join(data, school_districts, by="leaid11")
data <- data %>% rename(subregion=County.Name)

#Add computed column for revenue per student since this is 
#probably more predictive that total revenue
data$Total.General.Revenue.Per.Student <- 

#Remove Infinities
data <- data %>% 

#Remove schools that have per student revenues of over the .999 quantile.
data <- data %>% 
  filter(Total.General.Revenue.Per.Student < 
           quantile(Total.General.Revenue.Per.Student, .999, na.rm=TRUE))

#Remove Schools that have over 100 pupils per teacher. Either these values 
#are mistaken, or these schools have atypical class environments and I don't 
#want them affecting my analysis.
data <- data %>% filter(Pupil.Teacher.Ratio<=100)

Defining the Hypothesis Variables

My initial hypothesis about the data is that one of three variables (or some combination thereof) has an effect of the Graduation Rate for a given school. The three variables I’m hypothesising to have an effect are

  • Number of Students per School
  • Pupil/Teacher Ratio
  • Revenue per Student

In the analysis and exploration that follows, I look primarily at these variables (henceforth referred to as the hypothesis variables). Although I will periodically look at other aspects of the data.

Univariate and Bivariate Exploration

The most important variable in my dataset is ALL_RATE, which is the percentage of students who graduated from each school (i.e., Graduation Rate). This is the most interesting variable because it’s the best way to operationalize success in terms of the U.S. education system teaching our young people. My intuition is that Number of Students Per School, Pupil/Teacher Ratio and Revenue Per Student are going to be contributing factors to the graduation rate, so I’m very interested in those, too.


The first thing I do is create histograms for these four variables. Additionally, I plot the histogram of the Revenue Per Student variable with a log tranformation on the x-axis to better observe that variable’s distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   73.00   86.00   77.94   92.76  100.00    1049

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    36.0   102.0   173.6   270.0  1550.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.89   13.21   15.18   15.66   17.52   86.10

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    9631   11330   13120   14290  129600

The Graduation Rate distribution is somewhat normal, though it’s obviously cut off at 100 since it’s impossible for a percentage to go over 100, and it has a long left tail.

The Number of Students Per School distribution appears to be exponential, which was surprising to me. I would have expected it to be normal as well, with only a few small schools, many average-sized schools, and a few large schools. However, it turns out that’s not the case. When the number of students is divided by the number of teachers to yield the Pupil Teacher Ratio, then the data does fit a normal distribution, though.

The Revenue Per Student distribution isn’t obvious from the linear histrogram, but transforming the x axis to a log scale reveals a close to normal distribution that is a little right skewed. So the Revenue Per Student distribution is approximately log-normal.


#Generate correlation matrix of hypothesis variables and Graduation Rate
data %>% 
         Total.General.Revenue.Per.Student) %>% 
         rev.per.stud=Total.General.Revenue.Per.Student) %>% 
##                grad.rate num.students   pt.ratio rev.per.stud
## grad.rate     1.00000000    0.2173667 -0.1186047  -0.03554802
## num.students  0.21736669    1.0000000  0.2879385  -0.10063393
## pt.ratio     -0.11860466    0.2879385  1.0000000  -0.39486433
## rev.per.stud -0.03554802   -0.1006339 -0.3948643   1.00000000

The correlations between the three hypothesis variables and graduation rate are not strong (less than 0.3 in magnitude).

The strongest of the three is the Number of Students per School, which was not at all my first guess for strongest predictor. A possible explanation for this is that having a lot of peers increases the pressure for doing well in school. Additionally, larger schools are more likely to be in urban areas, so there could be a lot of other confounding factors that come along with that trend.

The Pupil/Teacher Ratio is weaker still, but at least the direction of the correlation is to be expected. It’s a negative correlation, so as the number of students per teacher goes up, the graduation rate tends to go down. However, this is a very weak relationship.

Finally, the last variable on the list was the one I expected to be the strongest predictor of graduation rate. Revenue Per Student has a paltry correlation constant of -0.035. There must be other factors at play here, because there is a significant negative correlation between Revenue Per Student and Pupil/Teacher Ratio. The two variables have a -0.372 correlation, so as Revenue Per Student goes up, there are fewer students for every teacher, and having fewer students per teacher is somewhat negatively correlated with higher graduation rates, but that chain of logic isn’t strong enough for there to be a correlation between Revenue Per Student and Graduation Rate.

Data exploration is about uncovering knowledge, and that knowledge is valuable if it confronts our preconceived notions about how something works. Usually, one thinks that this means showing a relationship between features that was previously undiscovered. However, it could just as easily be a demonstration that no relationship exists where “common sense” would lead one to believe a relationship exists. Based on these low correlation values, it appears that none of my hypothesis variables are strong predictors for Graduation Rate. For the remainder of my analysis, I will work on exploring this lack of a relationship.

Boxplots (by state)

Next, I want to do some regional comparisons. I start out with box plots of the four variables we’ve looked at so far by state.