2026-04-09
In the real world, models might be used for inference on samples which are outside of their domain of application.
Figure 1: Example of overconfident predictions for OOD samples (Hou 2023.)
Neural network-based classifiers may silently fail when the test data distribution differs from the training data. For critical tasks such as medical diagnosis or autonomous driving, it is thus essential to detect incorrect predictions based on an indication of whether the classifier is likely to fail.
Jaeger et al. (2023)
We learn a model that maps from input space to output space \(f: \mathcal X \mapsto \mathcal Y\) and combine it with a confidence score function \(s: \mathbb{R^d} \mapsto \mathbb{R^1}\).
Combined with a threshold \(\lambda\), we induce a decision function \(g\):
\[ g_{\lambda}(x|s) = \mathbb{1}{[s(x)>\lambda]}. \tag{1}\]
We can now either evaluate if \(g\) is successful in detecting OOD-samples: \[ g(x) = \begin{cases} OOD, \; \text{if} \; g(x) = 1, \\ ID, \; \text{if} \; g(x) = 0, \end{cases} \tag{2}\]
or, we must be successful in reducing the model’s risk on the non-rejected samples:
\[ (f,g)(x) = \begin{cases} reject, \; \text{if} \; g(x) = 1, \\ f(x), \; \text{if} \; g(x) = 0. \end{cases} \tag{3}\]
Area of Applicability (AOA) defines the domain in which a model is trusted to operate within estimated performance measures
Prior-art in shallow machine learning:
How to translate to deep learning models?
Latent representations learned by deep models provide a computationally efficient and reliable basis for defining an Area of Applicability for Earth‑observation models.
| Country | Train | Validation | Test | Total | Tiles |
|---|---|---|---|---|---|
| Austria | 24,972 | 10,649 | 8,176 | 43,797 | 16 |
| Belgium | 7,837 | 3,217 | 142 | 11,196 | 6 |
| Finland | 74,293 | 40,132 | 40,802 | 155,227 | 44 |
| Ireland | 22,482 | 12,687 | 13,157 | 48,326 | 9 |
| Kosovo | 849 | 487 | 280 | 1,616 | 1 |
| Lithuania | 23,170 | 12,128 | 13,067 | 48,365 | 15 |
| Luxembourg | 1,888 | 1,131 | 441 | 3,460 | 4 |
| Portugal | 43,721 | 22,862 | 23,209 | 89,792 | 12 |
| Serbia | 36,967 | 17,906 | 18,512 | 73,385 | 17 |
| Switzerland | 1,692 | 1,143 | 2,039 | 4,874 | 1 |
| Total | 237,871 | 122,342 | 119,825 | 480,038 | 125 |
Figure 3: Workflow figure indicating the studies methodology.
| term | coef | se | z | pvalue |
|---|---|---|---|---|
| Intercept | 36.7726 | 3.07693 | 11.9511 | 6.40848e-33 |
| C(split)[T.selected] | -33.4603 | 0.556324 | -60.1453 | 0 |
| C(method)[T.cosine] | 13.2992 | 0.556324 | 23.9054 | 2.69112e-126 |
| C(variance)[T.0.95] | 0.551059 | 0.786761 | 0.700414 | 0.483669 |
| C(variance)[T.0.85] | 1.36354 | 0.786761 | 1.73311 | 0.0830762 |
| C(variance)[T.0.75] | 1.80017 | 0.786761 | 2.28808 | 0.0221329 |
| C(k)[T.5] | -1.15813 | 0.963582 | -1.2019 | 0.229404 |
| C(k)[T.10] | -1.8663 | 0.963582 | -1.93684 | 0.0527655 |
| C(k)[T.50] | -3.14925 | 0.963582 | -3.26827 | 0.00108205 |
| C(k)[T.100] | -3.55072 | 0.963582 | -3.68492 | 0.000228775 |
| C(k)[T.200] | -4.18318 | 0.963582 | -4.34128 | 1.41658e-05 |
| Group Var | 1.16 | 0.554292 | 2.09276 | 0.036371 |
In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).
| term | coef | se | z | pvalue |
|---|---|---|---|---|
| Intercept | 49.1432 | 6.49398 | 7.5675 | 3.80473e-14 |
| C(split)[T.selected] | -35.3053 | 0.338039 | -104.441 | 0 |
| C(method)[T.cosine] | -3.34534 | 0.338039 | -9.8963 | 4.31969e-23 |
| C(variance)[T.0.95] | 0.128365 | 0.47806 | 0.268511 | 0.788306 |
| C(variance)[T.0.85] | -2.26831 | 0.47806 | -4.74481 | 2.08698e-06 |
| C(variance)[T.0.75] | -3.10488 | 0.47806 | -6.49475 | 8.31706e-11 |
| C(k)[T.5] | -0.0912834 | 0.585501 | -0.155906 | 0.876107 |
| C(k)[T.10] | -0.0374601 | 0.585501 | -0.0639794 | 0.948987 |
| C(k)[T.50] | 0.0228955 | 0.585501 | 0.0391041 | 0.968807 |
| C(k)[T.100] | 0.00714034 | 0.585501 | 0.0121953 | 0.99027 |
| C(k)[T.200] | -0.0759929 | 0.585501 | -0.129791 | 0.896732 |
| Group Var | 15.2626 | 7.22386 | 2.1128 | 0.0346178 |
In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).
| term | coef | se | z | pvalue |
|---|---|---|---|---|
| Intercept | 33.2899 | 3.65721 | 9.10254 | 8.82415e-20 |
| C(split)[T.selected] | -29.4211 | 0.391941 | -75.0652 | 0 |
| C(method)[T.cosine] | 5.86584 | 0.391941 | 14.9661 | 1.22217e-50 |
| C(variance)[T.0.95] | 3.0211 | 0.554288 | 5.45041 | 5.02529e-08 |
| C(variance)[T.0.85] | 3.0675 | 0.554288 | 5.53412 | 3.12789e-08 |
| C(variance)[T.0.75] | 2.98585 | 0.554288 | 5.38682 | 7.17153e-08 |
| C(k)[T.5] | -0.132517 | 0.678862 | -0.195204 | 0.845233 |
| C(k)[T.10] | -0.203769 | 0.678862 | -0.300162 | 0.764053 |
| C(k)[T.50] | -0.864168 | 0.678862 | -1.27297 | 0.20303 |
| C(k)[T.100] | -1.33516 | 0.678862 | -1.96677 | 0.0492097 |
| C(k)[T.200] | -1.90233 | 0.678862 | -2.80223 | 0.00507501 |
| Group Var | 3.51325 | 1.6661 | 2.10866 | 0.0349737 |
In-distribution error versus out-of-distribution error (each dot represents the average over a country dataset).
SYNFONY Workshop - April 9th, 2026, Offenbach