PolynomialRegression
Once the poor results of the Simple Linear Regression have been shown, we test more complex regression techniques to try to create a more accurate soil moisture dataproduct.
05Multiple LinearRegression
To this end, we use the Stats Models package in Python to create multiple linear regression models, in particular a combination of all input variables, interaction terms and polynomial regression. While we consider all variables in the MLR approach, we consider the two most correlated variables with the LMD soil moisture dataproduct in order to build an Interaction Terms approach. Regarding the polynomial regression model, we only consider the temperature amplitude, trying to create a function that fits better with the distribution of points in the scatter plot.
StatsModels
# Multiple Linear Regression model = smf.ols(formula='lmd_soilWetness ~ backscatter + emissivity_v + emissivity_h + ts_amplitude + ndvi', data=data) MLR = model.fit() MLR.summary()
Interaction Terms
# Interaction Terms model = smf.ols(formula='lmd_soilWetness ~ ts_amplitude * ndvi', data=data) terms = model.fit() terms.summary()
PolynomialRegression
# Polynomial Regression model = smf.ols(formula='lmd_soilWetness ~ 0 + I(ts_amplitude ** -0.5)', data=data) poly = model.fit() poly.summary() # Plot the Model Result x = pd.DataFrame({'ts_amplitude': np.linspace(data.ts_amplitude.min(), data.ts_amplitude.max(), 100)}) ax = data.plot.scatter('ts_amplitude','lmd_soilWetness',s=.5,color='k',alpha=0.2) ax.plot(x.ts_amplitude, poly.predict(x), 'g-', label='Poly n=2 $R^2$=%.2f' % poly.rsquared, alpha=0.9) # Plot Fitted vs Residual fig, ax = pl.subplots(figsize=(4, 4)) ax.scatter(poly.fittedvalues, poly.resid,c='k',s=1,alpha=0.5) # Model Output into Dataframe data['Poly_SM']= data['ts_amplitude'].apply(lambda x: 2.21*(x**-0.5)) data['Poly_SM'] = [x if x < 1 else 1 for x in data['Poly_SM']]
PolynomialRegression
Both Multiple Linear Regression (MLR) and Interaction Terms have shown a poor correlation between input and output data, having an r-squared around 0.5, a value that is far from ideal. However, and exponential function does achieve a higher r-squared, in this case above 0.9, the highest among the linear regression statistical techniques. The results is shown in this figure, where we have plotted the curve corresponding to the polynomial regression model. This model seems to better capture the relationship between temperature amplitude and the LMD soil moisture dataproduct.
Fittedvs. Residuals
Despite the significant improvement in the r-squared value using the polynomial regression model, the scatter plot between the fitted and the residuals shows the same pattern as the one displayed in the simple linear regression results. The pattern emerges either way, as the non-linearity among the two variables is inherent to the observations and therefore the pattern repeats iteself with very little variation.
SpatialDistribution
The polynomial regression model captures the spatial distribution observed in the LMD dataproduct, although with lower values overall. High values still belong to the equatorial regions and northern Europe and Morth America, while the minima are located in tropical latitudes.
Differencein Dataproducts
The polynomial regression model underestimates the soil moisture index in South America, particularly in the western part of the Amazon river basin. This is also observed in sub saharian regions in Africa as well as North America and Far-East pockets in Asia. Higher values are only observed in few regions in the Amazon basin and eastern parts of Africa.