Scaling Image Geo-Localization to Continent Level
Abstract
Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68% of queries of a dataset covering a large part of Europe.
Localization process. The prototypes are extracted from the model weights and upsampled to the target resolution using the S2Cell hierarchy. Aerial tiles roughly covering the cell are encoded using the aerial encoder and concatenated. Both databases are combined per-cell using a calibration factor, resulting in the final database of cell codes. During inference (right), we extract the embedding of a query image with the spatial encoder and we compute the similarity to all cell codes. The estimated location is the cell with the highest similarity.
Left: PCA visualization of the learned prototypes, which appear in different colors for e.g., urban, forested, or coastal areas. The high-frequency noise suggests that they also encode local distinctive information. Right: Test queries that are successfully localized (🟢) are uniformly distributed over the map, while failures (🔴) are prevalent in rural areas, where training data is sparser.
Left: Cell code PCA visualizations over BEDENL show smooth and informative structures that correlate with geographic patterns. Right: Self-similarities between a prototype and its top 50k neighbors. Red and blue correspond to high and low similarities, respectively. The prototypes are almost fully orthogonal, yet locally smooth.
Localization examples of easy, medium, and difficult cases, along with their rank (the position of the first cell within 200m in the sorted database list according to descriptor similarity).
Poster
BibTeX
@inproceedings{lindenberger2025scaling,
title={Scaling Image Geo-Localization to Continent Level},
author={Lindenberger, Philipp and Sarlin, Paul-Edouard and Hosang, Jan and Balice, Matteo and Pollefeys, Marc and Lynen, Simon and Trulls, Eduard},
booktitle={NeurIPS},
year={2025},
}