Area of Applicability for Deep Learning: Exploring Latent Space Geometry of Earth Observation Models

Darius A. Görgen
University of Münster

2026-04-09

Problem setting

Problem setting

In the real world, models might be used for inference on samples which are outside of their domain of application.

Figure 1: Example of overconfident predictions for OOD samples (Hou 2023.)

Neural network-based classifiers may silently fail when the test data distribution differs from the training data. For critical tasks such as medical diagnosis or autonomous driving, it is thus essential to detect incorrect predictions based on an indication of whether the classifier is likely to fail.
Jaeger et al. (2023)

Problem setting

We learn a model that maps from input space to output space \(f: \mathcal X \mapsto \mathcal Y\) and combine it with a confidence score function \(s: \mathbb{R^d} \mapsto \mathbb{R^1}\).

Combined with a threshold \(\lambda\), we induce a decision function \(g\):

\[ g_{\lambda}(x|s) = \mathbb{1}{[s(x)>\lambda]}. \tag{1}\]

We can now either evaluate if \(g\) is successful in detecting OOD-samples: \[ g(x) = \begin{cases} OOD, \; \text{if} \; g(x) = 1, \\ ID, \; \text{if} \; g(x) = 0, \end{cases} \tag{2}\]

or, we must be successful in reducing the model’s risk on the non-rejected samples:

\[ (f,g)(x) = \begin{cases} reject, \; \text{if} \; g(x) = 1, \\ f(x), \; \text{if} \; g(x) = 0. \end{cases} \tag{3}\]

Area of Applicability

  • Area of Applicability (AOA) defines the domain in which a model is trusted to operate within estimated performance measures

  • Prior-art in shallow machine learning:

  • How to translate to deep learning models?

    • inputs are often high-dimensional and complex (images, time series, etc.)
    • input space distances ignore the model’s internal representation of learned (spatial) features
    • CNNs learn compact and task-relevant latent representations of the input data

Working hypothesis

Latent representations learned by deep models provide a computationally efficient and reliable basis for defining an Area of Applicability for Earth‑observation models.

  • Do KNN-based confidence scores in the latent space of a deep model correlate with model performance?
  • How does the chosen distance function and its parameters affect the results?

Dataset

Dataset

  • We use the refined BigEarthNet (reBEN) dataset (Clasen et al. 2025), which is a large-scale benchmark for remote sensing image analysis
  • It consists of 480,038 Sentinel-2 image patches from 10 European countries, covering 125 tiles
  • Each image patch comes with a pixel-wise land cover classification based on the CORINE Land Cover dataset
  • This allows to construct and analyze three different tasks:
    • C: image-level multi-label classification
    • R: image-level multi-target regression
    • S: pixel-level segmentation

Dataset

Figure 2: European countries covered by the refined BigEarthNet dataset (blue) (A) and training/validation/test split for an idealized Sentinel-2 tile (B).

Dataset

Number of images and tiles in the refined BigEarthNet (v2) dataset.
Country Train Validation Test Total Tiles
Austria 24,972 10,649 8,176 43,797 16
Belgium 7,837 3,217 142 11,196 6
Finland 74,293 40,132 40,802 155,227 44
Ireland 22,482 12,687 13,157 48,326 9
Kosovo 849 487 280 1,616 1
Lithuania 23,170 12,128 13,067 48,365 15
Luxembourg 1,888 1,131 441 3,460 4
Portugal 43,721 22,862 23,209 89,792 12
Serbia 36,967 17,906 18,512 73,385 17
Switzerland 1,692 1,143 2,039 4,874 1
Total 237,871 122,342 119,825 480,038 125

Methodology

Methodology

Figure 3: Workflow figure indicating the studies methodology.

Results

Results

Figure 4: Average change in error for leave-one-country-out evaluation (intercept) and effects of selection and cosine distance.
  • the intercepts for all three tasks are positive and significant, indicating that the error increases when applying leave-one-country-out evaluation (33% to 56%)
  • applying selection reduces the error (30-35%), the effect is significant for all tasks.
  • for the regression task using cosine distance reduces the error (3%)
  • using cosine distance increases the error for the classification and segmentation task (13% and 6%)

Results

(A) classification task

(B) regression task

(C) segmentation task
  • for classification and segmentation, increasing K reduces the error, but the effect is not significant for all values of K
  • for regression, no effect of increasing K is observed
  • applying PCA decreases the error for the regression task, but increases it for the classification and segmentation task

Discussion

Discussion

  • Investigating the Area of Applicability for deep learning models in Earth observation is crucial for ensuring reliable and trustworthy model performance in real-world applications
  • Compared to the classical OOD detection setting, the AOA framework emphasizes the importance of defining a domain of application for a model based on estimated performance measures, rather than just detecting OOD samples
  • KNN-based confidence scores in the latent space of a deep model correlate with model performance, with a considerable effect size that depends on the task and the choice of distance function
  • the choice of distance function and its parameters moderately effect the results, but there is no clear pattern across tasks
  • further research is needed to explore if these findings translate to other datasets and tasks

References

Clasen, Kai Norman, Leonard Hackel, Tom Burgert, Gencer Sumbul, Begüm Demir, and Volker Markl. 2025. reBEN: Refined BigEarthNet Dataset for Remote Sensing Image Analysis. arXiv. https://doi.org/10.48550/arXiv.2407.03653.
Feranec, Jan, Tomas Soukup, Gerard Hazeu, and Gabriel Jaffrain, eds. 2016. European Landscape Dynamics: CORINE Land Cover Data. CRC Press. https://doi.org/10.1201/9781315372860.
Gawlikowski, Jakob, Sudipan Saha, Anna Kruspe, and Xiao Xiang Zhu. 2021. Out-of-Distribution Detection in Satellite Image Classification. arXiv:2104.05442. arXiv. https://doi.org/10.48550/arXiv.2104.05442.
Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. arXiv. https://doi.org/10.48550/arXiv.1706.04599.
Hendrycks, Dan, and Kevin Gimpel. 2018. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv. https://doi.org/10.48550/arXiv.1610.02136.
Jaeger, Paul F, Carsten Tim Lüth, Lukas Klein, and Till J. Bungert. 2023. “A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification.” International Conference on Learning Representations. https://openreview.net/forum?id=YnkGMIh0gvX.
Malkov, Yu A., and D. A. Yashunin. 2020. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE Trans. Pattern Anal. Mach. Intell. 42 (4): 824–36. https://doi.org/10.1109/TPAMI.2018.2889473.
Meyer, Hanna, and Edzer Pebesma. 2021. “Predicting into Unknown Space? Estimating the Area of Applicability of Spatial Prediction Models.” Methods in Ecology and Evolution 12 (9): 1620–33. https://doi.org/https://doi.org/10.1111/2041-210X.13650.
Schumacher, Fabian Lukas, Christian Knoth, Marvin Ludwig, and Hanna Meyer. 2025. “Estimation of Local Training Data Point Densities to Support the Assessment of Spatial Prediction Uncertainty.” Geoscientific Model Development 18 (December): 10185–202. https://doi.org/10.5194/gmd-18-10185-2025.
Stewart, Adam J., Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. 2025. TorchGeo: Deep Learning With Geospatial Data.” ACM Trans. Spatial Algorithms Syst. 11 (4): 15:1–28. https://doi.org/10.1145/3707459.
Sun, Yiyou, Yifei Ming, Xiaojin Zhu, and Yixuan Li. 2022. “Out-of-Distribution Detection with Deep Nearest Neighbors.” Proceedings of the 39th International Conference on Machine Learning, June 28, 20827–40. https://proceedings.mlr.press/v162/sun22d.html.

Backup slides

Classification results

term coef se z pvalue
Intercept 36.7726 3.07693 11.9511 6.40848e-33
C(split)[T.selected] -33.4603 0.556324 -60.1453 0
C(method)[T.cosine] 13.2992 0.556324 23.9054 2.69112e-126
C(variance)[T.0.95] 0.551059 0.786761 0.700414 0.483669
C(variance)[T.0.85] 1.36354 0.786761 1.73311 0.0830762
C(variance)[T.0.75] 1.80017 0.786761 2.28808 0.0221329
C(k)[T.5] -1.15813 0.963582 -1.2019 0.229404
C(k)[T.10] -1.8663 0.963582 -1.93684 0.0527655
C(k)[T.50] -3.14925 0.963582 -3.26827 0.00108205
C(k)[T.100] -3.55072 0.963582 -3.68492 0.000228775
C(k)[T.200] -4.18318 0.963582 -4.34128 1.41658e-05
Group Var 1.16 0.554292 2.09276 0.036371

Classification results

In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).

Regression results

term coef se z pvalue
Intercept 49.1432 6.49398 7.5675 3.80473e-14
C(split)[T.selected] -35.3053 0.338039 -104.441 0
C(method)[T.cosine] -3.34534 0.338039 -9.8963 4.31969e-23
C(variance)[T.0.95] 0.128365 0.47806 0.268511 0.788306
C(variance)[T.0.85] -2.26831 0.47806 -4.74481 2.08698e-06
C(variance)[T.0.75] -3.10488 0.47806 -6.49475 8.31706e-11
C(k)[T.5] -0.0912834 0.585501 -0.155906 0.876107
C(k)[T.10] -0.0374601 0.585501 -0.0639794 0.948987
C(k)[T.50] 0.0228955 0.585501 0.0391041 0.968807
C(k)[T.100] 0.00714034 0.585501 0.0121953 0.99027
C(k)[T.200] -0.0759929 0.585501 -0.129791 0.896732
Group Var 15.2626 7.22386 2.1128 0.0346178

Regression results

In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).

Segmentation results

term coef se z pvalue
Intercept 33.2899 3.65721 9.10254 8.82415e-20
C(split)[T.selected] -29.4211 0.391941 -75.0652 0
C(method)[T.cosine] 5.86584 0.391941 14.9661 1.22217e-50
C(variance)[T.0.95] 3.0211 0.554288 5.45041 5.02529e-08
C(variance)[T.0.85] 3.0675 0.554288 5.53412 3.12789e-08
C(variance)[T.0.75] 2.98585 0.554288 5.38682 7.17153e-08
C(k)[T.5] -0.132517 0.678862 -0.195204 0.845233
C(k)[T.10] -0.203769 0.678862 -0.300162 0.764053
C(k)[T.50] -0.864168 0.678862 -1.27297 0.20303
C(k)[T.100] -1.33516 0.678862 -1.96677 0.0492097
C(k)[T.200] -1.90233 0.678862 -2.80223 0.00507501
Group Var 3.51325 1.6661 2.10866 0.0349737

Segmentation results

In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).