SensitivityAnalysis
Preliminary statistical analysis can be performed to observe patterns and correlations among the variables provided in the satellite dataset making use of the Stats Models package.
04Simple LinearRegression
Four different basic steps in statistics are performed in this section. First we plot the scatter matrix among of the most correlated variables such as temperature amplitude and LMD soilmoisture. We then frame this correlation in the correlation matrix of all variables. We then perform a single regression analysis and we ultimately plot the Fitted versus Residuals to infer collinearities.
Stats Models
# Plot the scatter matrix ax = pd.plotting.scatter_matrix(data[['ts_amplitude','lmd_soilWetness']],ax=ax); # Correlation matrix data.corr(min_periods=3) # Simple Linear Regression X = pd.DataFrame({'intercept':np.ones(data.shape[0]),'TS_amplitude':data['ts_amplitude']}) y = data['lmd_soilWetness'] model = sm.OLS(y,X) results = model.fit() results.summary() results.params # Plot Fitted vs Residuals ax.scatter(results.fittedvalues, results.resid,c='k',s=1,alpha=0.2)
ScatterMatrix
The scatter matrix between temperature amplitude and the LMD soil moisture dataproduct does not reveal a clear or linear relationship among the two variables. However, there is a negative trend between both variables. The higher the temperature amplitude, the lower the soil moisture index from the LMD dataproduct. This assumption has to be backed with statistical analysys that can quantify this relationship.
CorrelationMatrix
The pandas package allows a quick estimation of the correlation coefficients between all variables contained in the dataset. We observe in this diagram a poor correlation in general between horizontal emissivity and the rest of the variables. And a high, yet negative correlation between amplitude and the LMD dataproduct. Aiming at being able to estimate soil moisture from these signals, we focus on the relationship between temperature amplitude and soil moisture.
SimpleLinear Regression
Making use of the Stats Models package in Python we are able to estimate a model that follows a linear relationship between an input and an output variable. In this case, we obtain an intercept at 0.84 and a slope of -0.01, being the r-squared of 0.471. This r-squared value is rather low and therefore this model is not quite representative of the relationship between input and output.
Fittedvs. Residuals
Plotting fitted versus residual results for each datapoint reveals the degree at which there is non-linearity between the input and output variables. Ideally, this plot should show a random distribution of points without any particular pattern. That would mean that the input and output are highly linear and that SLR could be useful. However, we observe a clear trend in the spatial distribution of the datapoints, with clear boundaries in the spatial distribution and a downward pattern in the overall plot. This fact suggest the use of transformations over the input variables in the model, as described in the next sections.