05
Multiple Linear
Regression


To this end, we use the Stats Models package in Python to create multiple linear regression models, in particular a combination of all input variables, interaction terms and polynomial regression.

While we consider all variables in the MLR approach, we consider the two most correlated variables with the LMD soil moisture dataproduct in order to build an Interaction Terms approach.

Regarding the polynomial regression model, we only consider the temperature amplitude, trying to create a function that fits better with the distribution of points in the scatter plot.

Stats
Models

# Multiple Linear Regression
model = smf.ols(formula='lmd_soilWetness ~ backscatter + emissivity_v + emissivity_h + ts_amplitude + ndvi', data=data)
MLR = model.fit()
MLR.summary()

Interaction
Terms

# Interaction Terms
model = smf.ols(formula='lmd_soilWetness ~ ts_amplitude * ndvi', data=data)
terms = model.fit()
terms.summary()

Polynomial
Regression

# Polynomial Regression
model = smf.ols(formula='lmd_soilWetness ~ 0 + I(ts_amplitude ** -0.5)', data=data)
poly = model.fit()
poly.summary()

# Plot the Model Result
x = pd.DataFrame({'ts_amplitude': np.linspace(data.ts_amplitude.min(), data.ts_amplitude.max(), 100)})
ax = data.plot.scatter('ts_amplitude','lmd_soilWetness',s=.5,color='k',alpha=0.2)
ax.plot(x.ts_amplitude, poly.predict(x), 'g-', label='Poly n=2 $R^2$=%.2f' % poly.rsquared, alpha=0.9)

# Plot Fitted vs Residual
fig, ax = pl.subplots(figsize=(4, 4))
ax.scatter(poly.fittedvalues, poly.resid,c='k',s=1,alpha=0.5)

# Model Output into Dataframe
data['Poly_SM']= data['ts_amplitude'].apply(lambda x: 2.21*(x**-0.5))
data['Poly_SM'] = [x if x < 1 else 1 for x in data['Poly_SM']]


Polynomial
Regression

Both Multiple Linear Regression (MLR) and Interaction Terms have shown a poor correlation between input and output data, having an r-squared around 0.5, a value that is far from ideal.

However, and exponential function does achieve a higher r-squared, in this case above 0.9, the highest among the linear regression statistical techniques.

The results is shown in this figure, where we have plotted the curve corresponding to the polynomial regression model. This model seems to better capture the relationship between temperature amplitude and the LMD soil moisture dataproduct.

Fitted
vs. Residuals

Despite the significant improvement in the r-squared value using the polynomial regression model, the scatter plot between the fitted and the residuals shows the same pattern as the one displayed in the simple linear regression results.

The pattern emerges either way, as the non-linearity among the two variables is inherent to the observations and therefore the pattern repeats iteself with very little variation.

Spatial
Distribution

The polynomial regression model captures the spatial distribution observed in the LMD dataproduct, although with lower values overall.

High values still belong to the equatorial regions and northern Europe and Morth America, while the minima are located in tropical latitudes.

Difference
in Dataproducts

The polynomial regression model underestimates the soil moisture index in South America, particularly in the western part of the Amazon river basin. This is also observed in sub saharian regions in Africa as well as North America and Far-East pockets in Asia.

Higher values are only observed in few regions in the Amazon basin and eastern parts of Africa.