Estimating American Indian/Alaska Native Population Using R and Linear Regression

By: Cleve Davis

In this exercise we will estimate the 2020 American Indian/Alaska Native population using linear regression from existing population estimates made by the Bureau of Indian Affairs (BIA). The dataset can be found here.  Before making service population predictions using OLS regression we need to program R to read the csv file.  This code allows you read it directly from the web. Once R can read the data we will examine it with a simple scatter plot.

###Code to read csv file from web
x<- read.csv(url("https://nativesci.com/wp-content/uploads/2020/03/aian.pops_.csv"))
###Basic scatterplot
plot(x$Year,x$Pop)

Although it is nice to have a scatter plot, it needs some work. One issue with our basic plot is that it is difficult to read the y-axis values. It would be better to have the y-axis values in the millions and change the corresponding label to “Millions”. We also want to change the x label to “Year”.

###Transform y-axis values to millions and change x- and y-axis labels
plot(x$Year,x$Pop/1000000,ylab="Millions",xlab="Year")

Now we want to

###Change symbology add title and control axis tick distance
plot(x$Year,x$Pop/1000000,ylab="Millions", xlab="",
	pch=19, cex=1.5, lwd=2, col="black",cex.lab=1.5,cex.main=3,ylim=c(1.2,2.5) )
title(main="AI/AN Population", xlab="Year", line=2.5, cex.lab=1.3)
###Create space on left and add regression line 
par(oma = c(0, 2, 0, 3))
plot(x$Year,x$Pop/1000000,ylab="Millions", xlab="",
	pch=19, cex=1.5, lwd=2, col="black",cex.lab=1.5,cex.main=3,ylim=c(1.2,2.5) )
title(main="AI/AN Population", xlab="Year", line=2.5, cex.lab=1.3)
abline(lm(x$Pop/1000000~x$Year),col="red",lwd=2)

There we go! But we want to get the estimated population size using the linear regression model. That way we don’t have to guess what the actual value is for 2020 using the graphic. Using our linear model we can see that our model estimates there are 2,452,156 AI/AN.

###Get population data from lm
fit<-lm(Pop~Year,data=x)
summary(fit)
new<-data.frame(Year=seq(1995,2020,1))
pred<-predict(fit,type="response",newdata=new)
x2<-cbind(x,pred)
x2

Leave a Reply

Your email address will not be published. Required fields are marked *