06
K-Means
Clustering


K-Means is one of the most popular clustering algorithms. The K-means algorithm performs sequential iterations looking to minimise the distance among the datapoints belonging to a pre-defined number of groups or clusters.

The algorithm converges consistenly for different initializations. In this case, the algorithm takes into account a multi-dimensional approach including the datapoints of backscatter, emissivities, temperature amplitude and NDVI.

K-Means
Build

The input data for the K-Means algorithm is the combination of all input variables.

Once the gridpoints are assigned to particular clusters, the average value of the LMD soil moisture dataproduct are associated to each one of the clusters.

Sklearn.cluster
K-Means

# Input data
input = data[['ts_amplitude','ndvi','backscatter','emissivity_v','emissivity_h']]

# K-Means parameters
clusters = KMeans(n_clusters=10, random_state=0, n_init="auto").fit_predict(input)
data['clusters']=clusters

# Assign values to clusters
cluster_values = [ ]
for i in range(0,10):
   value = data['lmd_soilWetness'].loc[data['clusters'] == i].mean()
   cluster_values.append(value)
data['Kmeans_SM']= data['clusters'].apply(lambda x: cluster_values[x])

Clusters
Distribution

The ten pre-defined clusters are distributed in size as shown here. There are five main clusters above 4000 datapoints each, while the smallest account to around 1400 datapoints.

Overall there is a uniform distribution of datapoints in all the spectrum of values provided by the LMD soil moisture dataproduct.

Spatial
Distribution

The K-Means clustering method captures the main patterns and regions of interest that the original LMD soil moisture dataproduct provides.

This fact shows that the input variables considered have the information required to construct a similar dataproduct. While their relative weight in this construction remains unclear using this method, it sets the path to further analysis based on this dataset.