Area of Applicability for Deep Learning: Exploring Latent Space Geometry of Earth Observation Models

Darius A. Görgen
University of Münster

2026-04-09

Problem setting

In the real world, models might be used for inference on samples which are outside of their domain of application.

Figure 1: Example of overconfident predictions for OOD samples (Hou 2023.)

Neural network-based classifiers may silently fail when the test data distribution differs from the training data. For critical tasks such as medical diagnosis or autonomous driving, it is thus essential to detect incorrect predictions based on an indication of whether the classifier is likely to fail.
Jaeger et al. (2023)

Problem setting

We learn a model that maps from input space to output space \(f: \mathcal X \mapsto \mathcal Y\) and combine it with a confidence score function \(s: \mathbb{R^d} \mapsto \mathbb{R^1}\).

Combined with a threshold \(\lambda\), we induce a decision function \(g\):

\[ g_{\lambda}(x|s) = \mathbb{1}{[s(x)>\lambda]}. \tag{1}\]

We can now either evaluate if \(g\) is successful in detecting OOD-samples: \[ g(x) = \begin{cases} OOD, \; \text{if} \; g(x) = 1, \\ ID, \; \text{if} \; g(x) = 0, \end{cases} \tag{2}\]

or, we must be successful in reducing the model’s risk on the non-rejected samples:

\[ (f,g)(x) = \begin{cases} reject, \; \text{if} \; g(x) = 1, \\ f(x), \; \text{if} \; g(x) = 0. \end{cases} \tag{3}\]

Area of Applicability

Area of Applicability (AOA) defines the domain in which a model is trusted to operate within estimated performance measures
Prior-art in shallow machine learning:
- AOA based on dissimilarity index (DI) in input space (Meyer and Pebesma 2021)
- recently extended by the local point density (LPD) in input space (Schumacher et al. 2025)
How to translate to deep learning models?
- inputs are often high-dimensional and complex (images, time series, etc.)
- input space distances ignore the model’s internal representation of learned (spatial) features
- CNNs learn compact and task-relevant latent representations of the input data

Working hypothesis

Latent representations learned by deep models provide a computationally efficient and reliable basis for defining an Area of Applicability for Earth‑observation models.

Do KNN-based confidence scores in the latent space of a deep model correlate with model performance?
How does the chosen distance function and its parameters affect the results?

Dataset

We use the refined BigEarthNet (reBEN) dataset (Clasen et al. 2025), which is a large-scale benchmark for remote sensing image analysis
It consists of 480,038 Sentinel-2 image patches from 10 European countries, covering 125 tiles
Each image patch comes with a pixel-wise land cover classification based on the CORINE Land Cover dataset
This allows to construct and analyze three different tasks:
- C: image-level multi-label classification
- R: image-level multi-target regression
- S: pixel-level segmentation

Dataset

Number of images and tiles in the refined BigEarthNet (v2) dataset.
Country	Train	Validation	Test	Total	Tiles
Austria	24,972	10,649	8,176	43,797	16
Belgium	7,837	3,217	142	11,196	6
Finland	74,293	40,132	40,802	155,227	44
Ireland	22,482	12,687	13,157	48,326	9
Kosovo	849	487	280	1,616	1
Lithuania	23,170	12,128	13,067	48,365	15
Luxembourg	1,888	1,131	441	3,460	4
Portugal	43,721	22,862	23,209	89,792	12
Serbia	36,967	17,906	18,512	73,385	17
Switzerland	1,692	1,143	2,039	4,874	1
Total	237,871	122,342	119,825	480,038	125

Methodology

Figure 3: Workflow figure indicating the studies methodology.

Results

Figure 4: Average change in error for leave-one-country-out evaluation (intercept) and effects of selection and cosine distance.

the intercepts for all three tasks are positive and significant, indicating that the error increases when applying leave-one-country-out evaluation (33% to 56%)
applying selection reduces the error (30-35%), the effect is significant for all tasks.
for the regression task using cosine distance reduces the error (3%)
using cosine distance increases the error for the classification and segmentation task (13% and 6%)

Results

for classification and segmentation, increasing K reduces the error, but the effect is not significant for all values of K
for regression, no effect of increasing K is observed
applying PCA decreases the error for the regression task, but increases it for the classification and segmentation task

Discussion

Investigating the Area of Applicability for deep learning models in Earth observation is crucial for ensuring reliable and trustworthy model performance in real-world applications
Compared to the classical OOD detection setting, the AOA framework emphasizes the importance of defining a domain of application for a model based on estimated performance measures, rather than just detecting OOD samples
KNN-based confidence scores in the latent space of a deep model correlate with model performance, with a considerable effect size that depends on the task and the choice of distance function
the choice of distance function and its parameters moderately effect the results, but there is no clear pattern across tasks
further research is needed to explore if these findings translate to other datasets and tasks

What’s next?

presentation of the study at EGU2026 in Vienna (May 2026)
submission of a journal paper to a remote sensing or machine learning journal (June 2026)
further research on the Area of Applicability for deep learning models in Earth observation, including:
- exploring additional confidence score functions and their correlation with model performance
- investigating the effect of different model architectures and training strategies on the AOA
- applying the AOA framework to other datasets and tasks in Earth observation

References

Clasen, Kai Norman, Leonard Hackel, Tom Burgert, Gencer Sumbul, Begüm Demir, and Volker Markl. 2025. reBEN: Refined BigEarthNet Dataset for Remote Sensing Image Analysis. arXiv. https://doi.org/10.48550/arXiv.2407.03653.

Feranec, Jan, Tomas Soukup, Gerard Hazeu, and Gabriel Jaffrain, eds. 2016. European Landscape Dynamics: CORINE Land Cover Data. CRC Press. https://doi.org/10.1201/9781315372860.

Gawlikowski, Jakob, Sudipan Saha, Anna Kruspe, and Xiao Xiang Zhu. 2021. Out-of-Distribution Detection in Satellite Image Classification. arXiv:2104.05442. arXiv. https://doi.org/10.48550/arXiv.2104.05442.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. arXiv. https://doi.org/10.48550/arXiv.1706.04599.

Hendrycks, Dan, and Kevin Gimpel. 2018. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv. https://doi.org/10.48550/arXiv.1610.02136.

Jaeger, Paul F, Carsten Tim Lüth, Lukas Klein, and Till J. Bungert. 2023. “A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification.” International Conference on Learning Representations. https://openreview.net/forum?id=YnkGMIh0gvX.

Malkov, Yu A., and D. A. Yashunin. 2020. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE Trans. Pattern Anal. Mach. Intell. 42 (4): 824–36. https://doi.org/10.1109/TPAMI.2018.2889473.

Meyer, Hanna, and Edzer Pebesma. 2021. “Predicting into Unknown Space? Estimating the Area of Applicability of Spatial Prediction Models.” Methods in Ecology and Evolution 12 (9): 1620–33. https://doi.org/https://doi.org/10.1111/2041-210X.13650.

Schumacher, Fabian Lukas, Christian Knoth, Marvin Ludwig, and Hanna Meyer. 2025. “Estimation of Local Training Data Point Densities to Support the Assessment of Spatial Prediction Uncertainty.” Geoscientific Model Development 18 (December): 10185–202. https://doi.org/10.5194/gmd-18-10185-2025.

Stewart, Adam J., Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. 2025. “TorchGeo: Deep Learning With Geospatial Data.” ACM Trans. Spatial Algorithms Syst. 11 (4): 15:1–28. https://doi.org/10.1145/3707459.

Sun, Yiyou, Yifei Ming, Xiaojin Zhu, and Yixuan Li. 2022. “Out-of-Distribution Detection with Deep Nearest Neighbors.” Proceedings of the 39th International Conference on Machine Learning, June 28, 20827–40. https://proceedings.mlr.press/v162/sun22d.html.

Backup slides

Classification results

term	coef	se	z	pvalue
Intercept	36.7726	3.07693	11.9511	6.40848e-33
C(split)[T.selected]	-33.4603	0.556324	-60.1453	0
C(method)[T.cosine]	13.2992	0.556324	23.9054	2.69112e-126
C(variance)[T.0.95]	0.551059	0.786761	0.700414	0.483669
C(variance)[T.0.85]	1.36354	0.786761	1.73311	0.0830762
C(variance)[T.0.75]	1.80017	0.786761	2.28808	0.0221329
C(k)[T.5]	-1.15813	0.963582	-1.2019	0.229404
C(k)[T.10]	-1.8663	0.963582	-1.93684	0.0527655
C(k)[T.50]	-3.14925	0.963582	-3.26827	0.00108205
C(k)[T.100]	-3.55072	0.963582	-3.68492	0.000228775
C(k)[T.200]	-4.18318	0.963582	-4.34128	1.41658e-05
Group Var	1.16	0.554292	2.09276	0.036371

Classification results

In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).

Regression results

term	coef	se	z	pvalue
Intercept	49.1432	6.49398	7.5675	3.80473e-14
C(split)[T.selected]	-35.3053	0.338039	-104.441	0
C(method)[T.cosine]	-3.34534	0.338039	-9.8963	4.31969e-23
C(variance)[T.0.95]	0.128365	0.47806	0.268511	0.788306
C(variance)[T.0.85]	-2.26831	0.47806	-4.74481	2.08698e-06
C(variance)[T.0.75]	-3.10488	0.47806	-6.49475	8.31706e-11
C(k)[T.5]	-0.0912834	0.585501	-0.155906	0.876107
C(k)[T.10]	-0.0374601	0.585501	-0.0639794	0.948987
C(k)[T.50]	0.0228955	0.585501	0.0391041	0.968807
C(k)[T.100]	0.00714034	0.585501	0.0121953	0.99027
C(k)[T.200]	-0.0759929	0.585501	-0.129791	0.896732
Group Var	15.2626	7.22386	2.1128	0.0346178

Regression results

In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).

Segmentation results

term	coef	se	z	pvalue
Intercept	33.2899	3.65721	9.10254	8.82415e-20
C(split)[T.selected]	-29.4211	0.391941	-75.0652	0
C(method)[T.cosine]	5.86584	0.391941	14.9661	1.22217e-50
C(variance)[T.0.95]	3.0211	0.554288	5.45041	5.02529e-08
C(variance)[T.0.85]	3.0675	0.554288	5.53412	3.12789e-08
C(variance)[T.0.75]	2.98585	0.554288	5.38682	7.17153e-08
C(k)[T.5]	-0.132517	0.678862	-0.195204	0.845233
C(k)[T.10]	-0.203769	0.678862	-0.300162	0.764053
C(k)[T.50]	-0.864168	0.678862	-1.27297	0.20303
C(k)[T.100]	-1.33516	0.678862	-1.96677	0.0492097
C(k)[T.200]	-1.90233	0.678862	-2.80223	0.00507501
Group Var	3.51325	1.6661	2.10866	0.0349737

Segmentation results

In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).