Wasserstein Distance as a Tool for Analyzing Large-Ensemble Datasets

Yuki Yasuda; Shoichiro Kido

10:15 AM - 10:30 AM

[ACG38-06] Wasserstein Distance as a Tool for Analyzing Large-Ensemble Datasets

*Yuki Yasuda¹, Shoichiro Kido² (1.Institute of Science Tokyo, 2.Japan Agency for Marine-Earth Science and Technology)

Keywords:Large-Ensemble Simulation, Variability, Information Theory, Wasserstein distance, Optimal Transport Theory

The atmosphere-ocean system exhibits two types of variability: forced variability in response to external forcing (e.g., changes in radiative forcing), and internal variability arising from chaotic behavior due to nonlinearity. The relative dominance of these components depends on the spatiotemporal scale of interest [1]. To quantitatively assess relative contributions from forced and intrinsic variability, large-ensemble simulations (LE-simulations) using atmosphere-ocean general circulation models have been extensively conducted for recent decades [2]. However, most studies using LE-simulation data still rely on conventional analysis methods that assume Gaussian distributions for variability, while the underlying distributions could be non-Gaussian [3]. In this regard, Sane et al. [4] applied information theory to propose a measure of internal variability strength, g_Sane. While this metric is applicable to non-Gaussian distributions, it is not clear whether it is also applicable to heavy-tailed distributions. Here, we present a new indicator g_W based on the Wasserstein distance, which quantifies the distance between any frequency distributions. We demonstrate that this new indicator may be more suitable for analysis of LE-simulation data compared to g_Sane.

Consider all ensemble members at a given location, where X represents a physical quantity and X_ave denotes its ensemble mean (see the top panels in Figure). In data analysis, X comprises time series of all ensemble members, and X_ave is their ensemble mean time series. Following Sane et al. [4], g_Sane is defined by Eq. (1) in Figure, where I(X:X_ave) is the mutual information representing the degree of nonlinear correlation between X and X_ave, and H(X) is the Shannon entropy quantifying the uncertainty in X. The indicator g_Sane ranges from 0 to 1, with higher values indicating greater inter-ensemble variability (i.e., weaker correlation between X and X_ave).

We propose a new indicator g_W defined by Eq. (2) in Figure, where X_med denotes the ensemble median of X. Here, X(q) denotes the q-th quantile of X (similarly for X_med(q)), and the integral represents the Wasserstein distance between the frequency distributions of X and X_med [5]. The normalization constant MAE represents the mean absolute error of X from its median. Like g_Sane, g_W ranges from 0 to 1 and is applicable to non-Gaussian distributions, with larger values indicating greater inter-ensemble variability (i.e., the system is more chaotic rather than deterministic). However, unlike g_Sane, g_W does not require additional parameters, such as bin widths, for its computation.

We evaluated both indicators by applying them to a simple toy-model, following Sane et al. [4], where the magnitude of internal variability (i.e., the degree of stochasticity) was prescribed. As internal variability decreased, the dispersion between random variables decreased, and both indicators showed correspondingly lower values. We then applied both indicators to near-surface temperature data from the Community Earth System Model Large Ensemble (CESM LENS) 20th-century historical experiment with 40 ensemble members [2].

The bottom panels in Figure show the spatial distributions of g_Sane and g_W for near-surface temperature around Japan. Both indicators show lower and higher values over land and ocean, respectively, which likely reflects deterministic and chaotic nature of temperature over land and ocean, respectively. While g_Sane exhibits a patchy pattern that is somewhat difficult to interpret, g_W shows smooth transitions between land and ocean regions. This suggests that g_W may be more suitable for analyzing large-ensemble simulation results. In our presentation, we will also discuss pros and cons of our novel indicator and comparison with other existing metrics in detail.

[1] Hawkins and Sutton (2009), BAMS.
[2] Kay et al. (2015), BAMS.
[3] Franzke et al., (2020), Rev. Geophys.
[4] Sane et al. (2024), JGR Ocean.
[5] Peyré and Cuturi (2020), arXiv:1803.00567.

Presentation information

[A-CG38] Climate Variability and Predictability on Subseasonal to Centennial Timescales

[ACG38-06] Wasserstein Distance as a Tool for Analyzing Large-Ensemble Datasets