# Estimating greenhouse gas emissions from EPA monitored sites with interpolation

#### Research Question

When we take some measurements at certain points, we might want to know what would be the likely values in areas where we don’t have any points measured. We use interpolation to find the answer to that question. Interpolation addresses a second order process. The difference between first order and second order point pattern analysis is the relationship between points. In a first order process, we are assuming that there is an equal chance for the points to occur anywhere in our geographic area. This assumption is based on the independence of points from each other. However, in a second order process we are not assuming the points are independent of one another. Instead, we are assuming that if we have one point of a certain value, there is a great chance of having other points nearby of a similar value, than in other areas. Therefore, our null hypothesis is that our points do not exert any influence on the points around them, and we expect to see a totally random distribution of values, or complete spatial randomness (CSR)

Below we analyze the methods and results of Nearest Neighbor and Inverse Distance Weighting analyses, and offer an overview of another method, Kriging. The first two do not include any way of estimating the error or probability of these measurements. Nearest Neighbor and Inverse Distance Weighting are based on distance or area and are purely deterministic analyses. Kriging not only makes estimates for the measurements in between points, but also provides an estimate of error for those points. Kriging uses a spatial statistical analysis for the points it is estimating, offering a probabilistic method for estimating points and the error in those values.

#### Measures

Our points come from 2015 greenhouse gas emissions from EPA monitored sites, where each site has a listing for their emissions in carbon metric tons. There are 89 sites with emissions readings between 0 and 4,184,267. Interestingly, the refinery that just caught fire this weekend is the second highest emitter of greenhouse gases, at 3,110,037 carbon metric tons and narrowly missed a dangerous release of hydrofluoric acid during this most recent fire. Our study area includes the four Mid-Atlantic states – Pennsylvania, Delaware, New Jersey, and Maryland. Most of the points we have readings for are along the Delaware River and surrounding areas.

#### Nearest Neighbor

In the analysis below, the region has been tessellated into regions related to the point they are closest to. These areas are called Thiessen polygons, and the areas closest to the points are smaller as the mean distance between the points is much smaller, where the areas around the edges essentially have a mean going to infinity, since we don’t have another known point with which to calculate a mean with. After the Thiessen polygons are created, we mask the surface with our states area, so that we can gain the values for the geographic area we are looking at. Then by measuring the distance to our measured value, Nearest Neighbor estimates a value for the unmeasured points. Below we can see the small areas closest to the points that have slightly higher estimated values, but the vast majority of our area has so few points and is so far from the points measured that the predicted greenhouse gas emissions is almost none.

#### Inverse Distance Weighting

In this method, a raster grid is created that covers the area we want to estimate values for. Then using a formula of inverse distance from our points, each square of the grid has a value estimated based on the known points. However, it is assumed that each predicted value is influenced more by points closer to it than points further away. This function is affected by what power of inverse relationship is used, with a faster decay rate at a power of 4 than a power of 2, for example. Like Nearest Neighbor, this isn’t probabilistic and therefore the likelihood of the estimates can’t be measured. The accuracy must be determined by taking more measurements.

#### Regression of coordinates

Is there a significant relationship between the data and the geographical coordinates? No, the p-value is not significant, so we can’t say that geography alone is predictive of the changes in metric tonnes of carbon.

Residuals:
Min 1Q Median 3Q Max
-568071 -326112 -224745 -42966 3719087

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25208998.428 19042456.307 1.324 0.189
ghg\$X 2.039 3.587 0.568 0.571
ghg\$Y -5.853 4.508 -1.298 0.198

Residual standard error: 710500 on 86 degrees of freedom
Multiple R-squared: 0.0195, Adjusted R-squared: -0.003306
F-statistic: 0.855 on 2 and 86 DF, p-value: 0.4289