10:15 AM - 10:30 AM
[ACG38-06] Wasserstein Distance as a Tool for Analyzing Large-Ensemble Datasets
Keywords:Large-Ensemble Simulation, Variability, Information Theory, Wasserstein distance, Optimal Transport Theory
Consider all ensemble members at a given location, where X represents a physical quantity and Xave denotes its ensemble mean (see the top panels in Figure). In data analysis, X comprises time series of all ensemble members, and Xave is their ensemble mean time series. Following Sane et al. [4], gSane is defined by Eq. (1) in Figure, where I(X:Xave) is the mutual information representing the degree of nonlinear correlation between X and Xave, and H(X) is the Shannon entropy quantifying the uncertainty in X. The indicator gSane ranges from 0 to 1, with higher values indicating greater inter-ensemble variability (i.e., weaker correlation between X and Xave).
We propose a new indicator gW defined by Eq. (2) in Figure, where Xmed denotes the ensemble median of X. Here, X(q) denotes the q-th quantile of X (similarly for Xmed(q)), and the integral represents the Wasserstein distance between the frequency distributions of X and Xmed [5]. The normalization constant MAE represents the mean absolute error of X from its median. Like gSane, gW ranges from 0 to 1 and is applicable to non-Gaussian distributions, with larger values indicating greater inter-ensemble variability (i.e., the system is more chaotic rather than deterministic). However, unlike gSane, gW does not require additional parameters, such as bin widths, for its computation.
We evaluated both indicators by applying them to a simple toy-model, following Sane et al. [4], where the magnitude of internal variability (i.e., the degree of stochasticity) was prescribed. As internal variability decreased, the dispersion between random variables decreased, and both indicators showed correspondingly lower values. We then applied both indicators to near-surface temperature data from the Community Earth System Model Large Ensemble (CESM LENS) 20th-century historical experiment with 40 ensemble members [2].
The bottom panels in Figure show the spatial distributions of gSane and gW for near-surface temperature around Japan. Both indicators show lower and higher values over land and ocean, respectively, which likely reflects deterministic and chaotic nature of temperature over land and ocean, respectively. While gSane exhibits a patchy pattern that is somewhat difficult to interpret, gW shows smooth transitions between land and ocean regions. This suggests that gW may be more suitable for analyzing large-ensemble simulation results. In our presentation, we will also discuss pros and cons of our novel indicator and comparison with other existing metrics in detail.
[1] Hawkins and Sutton (2009), BAMS.
[2] Kay et al. (2015), BAMS.
[3] Franzke et al., (2020), Rev. Geophys.
[4] Sane et al. (2024), JGR Ocean.
[5] Peyré and Cuturi (2020), arXiv:1803.00567.