Rising transformer-based imaginative and prescient fashions for geospatial knowledge—additionally known as geospatial basis fashions (GeoFMs)—supply a brand new and highly effective know-how for mapping the earth’s floor at a continental scale, offering stakeholders with the tooling to detect and monitor surface-level ecosystem situations reminiscent of forest degradation, pure catastrophe impression, crop yield, and plenty of others.
GeoFMs signify an rising analysis area and are a kind of pre-trained imaginative and prescient transformer (ViT) particularly tailored to geospatial knowledge sources. GeoFMs supply instant worth with out coaching. The fashions excel as embedding fashions for geospatial similarity search and ecosystem change detection. With minimal labeled knowledge, GeoFMs might be fine-tuned for customized duties reminiscent of land floor classification, semantic segmentation, or pixel-level regression. Many main fashions can be found underneath very permissive licenses making them accessible for a large viewers. Examples embody SatVision-Base, Prithvi-100M, SatMAE, and Clay (used on this resolution).
On this publish, we discover how Clay Basis’s Clay basis mannequin, obtainable on Hugging Face, might be deployed for large-scale inference and fine-tuning on Amazon SageMaker. For illustrative functions, we concentrate on a deforestation use case from the Amazon rainforest, one of many most biodiverse ecosystems on the earth. Given the robust proof that the Amazon forest system might quickly be reaching a tipping level, it presents an necessary area of examine and a high-impact utility space for GeoFMs, for instance, by early detection of forest degradation. Nevertheless, the answer introduced right here generalizes to a variety of geospatial use instances. It additionally comes with ready-to-deploy code samples that can assist you get began shortly with deploying GeoFMs in your personal functions on AWS.
Let’s dive in!
Answer overview
On the core of our resolution is a GeoFM. Architecturally, GeoFMs construct on the ViT structure first launched within the seminal 2022 analysis paper An Picture is Price 16×16 Phrases: Transformers for Picture Recognition at Scale. To account for the precise properties of geospatial knowledge (a number of channels starting from ultraviolet to infrared, various electromagnetic spectrum protection, and spatio-temporal nature of knowledge), GeoFMs incorporate a number of architectural improvements reminiscent of variable enter measurement (to seize a number of channels) or the addition of positional embeddings that seize spatio-temporal facets reminiscent of seasonality and placement on earth. The pre-training of those fashions is carried out on unlabeled geospatial knowledge sampled from throughout the globe utilizing masked autoencoders (MAE) as self-supervised learners. Sampling from global-scale knowledge helps make sure that numerous ecosystems and floor varieties are represented appropriately within the coaching set. What outcomes are common function fashions that can be utilized for 3 core use instances:
- Geospatial similarity search: Rapidly map numerous floor varieties with semantic geospatial search utilizing the embeddings to search out related objects (reminiscent of deforested areas).
- Embedding-based change detection: Analyze a time collection of geospatial embeddings to determine floor disruptions over time for a particular area.
- Customized geospatial machine studying: High quality-tune a specialised regression, classification, or segmentation mannequin for geospatial machine studying (ML) duties. Whereas this requires a specific amount of labeled knowledge, general knowledge necessities are sometimes a lot decrease in comparison with coaching a devoted mannequin from the bottom up.
The overall resolution circulation is proven within the following diagram. Word that this circulation diagram is extremely abstracted and omits sure architectural particulars for causes of readability. For a full structure diagram demonstrating how the circulation might be carried out on AWS, see the accompanying GitHub repository. This repository additionally incorporates detailed deployment directions to get you began shortly with making use of GeoFMs to your personal use instances.
- Retrieve and course of satellite tv for pc imagery for GeoFM inference or coaching: Step one is to get the uncooked geospatial knowledge right into a format that’s consumable by the GeoFM. This entails breaking down the big uncooked satellite tv for pc imagery into equally-sized 256×256 pixel chips (the scale that the mode expects) and normalizing pixel values, amongst different knowledge preparation steps required by the GeoFM that you just select. This routine might be carried out at scale utilizing an Amazon SageMaker AI processing job.
- Retrieve mannequin weights and deploy the GeoFM: Subsequent, retrieve the open weights of the GeoFM from a mannequin registry of your alternative (HuggingFace on this instance) and deploy the mannequin for inference. One of the best deployment choice finally will depend on how the mannequin is consumed. If you have to generate embedding asynchronously, use a SageMaker AI processing or rework step. For real-time inference, think about deploying to a SageMaker AI real-time endpoint, which might be configured to auto-scale with demand, permitting for large-scale inference. On this instance, we use a SageMaker AI processing job with a customized Docker picture for producing embeddings in batch.
- Generate geospatial embeddings: The GeoFM is an encoder-only mannequin, that means that it outputs an embedding vector. Throughout inference, you carry out a ahead move of the pre-processed satellite tv for pc picture chip by the GeoFM. This produces the corresponding embedding vector, which might be regarded as a compressed illustration of the knowledge contained within the picture. This course of is equal to utilizing textual content embedding fashions for RAG use instances or related.
The generated geospatial embeddings can be utilized largely as-is for 2 key use instances: geospatial similarity search and ecosystem change detection.
- Run similarity search on the embeddings to determine semantically related photographs: The GeoFM embeddings reside in the identical vector area. This enables us to determine related objects by figuring out vectors which might be very near a given question level. A standard high-performance search algorithm for that is approximate nearest neighbor (ANN). For scalability and search efficiency, we index the embedding vectors in a vector database.
- Analyze time-series of embeddings for break factors that point out change: As an alternative of on the lookout for similarity between embedding vectors, you may also search for distance. Doing this for a particular area and throughout time allows you to pinpoint particular occasions the place change happens. This lets you use embeddings for floor change detection over time, a quite common use case in geospatial analytics.
Optionally, you may also fine-tune a mannequin on prime of the GeoFM.
- Prepare a customized head and run inference: To fine-tune a mannequin you add a customized (and sometimes light-weight) head on prime of the GeoFM and fine-tune it on a (typically small) labeled dataset. The GeoFM weights stay frozen and are usually not retrained. The customized head takes the GeoFM-generated embedding vectors as enter and produces classification masks, pixel-level recessions outcomes, or just a category per picture, relying on the use case.
We discover the important thing steps of this workflow within the subsequent sections. For added particulars on the implementation—together with. the right way to construct a high-quality consumer interface with Solara—see the accompanying GitHub repository.
Geospatial knowledge processing and embedding technology
Our complete, four-stage knowledge processing pipeline transforms uncooked satellite tv for pc imagery into analysis-ready vector embeddings that energy superior geospatial analytics. This orchestrated workflow makes use of Amazon SageMaker AI Pipelines to create a strong, reproducible, and scalable processing structure. The top-to-end resolution can course of Earth statement knowledge for a specific area of curiosity, with built-in flexibility to adapt to totally different use instances. On this instance, we use Sentinel-2 imagery from the Amazon Registry of Open Information for monitoring deforestation within the Brazilian rainforest. Nevertheless, our pipeline structure is designed to work seamlessly with different satellite tv for pc picture suppliers and resolutions (reminiscent of NAIP with 1m/pixel decision, or Maxar and Planet Labs as much as beneath 1m/pixel decision).
Pipeline structure overview
The SageMaker pipeline consists of 4 processing steps, proven within the previous determine, every step builds on the outputs of the earlier steps with intermediate outcomes saved in Amazon Easy Storage Service (Amazon S3).
- Pre-process satellite tv for pc tiles: Divides the satellite tv for pc imagery into chips. We selected a chip measurement of 256×256 pixels as anticipated by Clay v1. For Sentinel-2 photographs this corresponds to an space of two.56 x 2.56 km2.
- Generate embeddings: Creates 768-dimensional vector representations for the chips utilizing the Clay v1 mannequin.
- Course of embeddings: Performs dimensionality discount and computes similarity metrics (for downstream analyses).
- Consolidate and index: Consolidates outputs and masses embeddings vectors right into a Vector retailer.
Step 1: Satellite tv for pc knowledge acquisition and chipping
The pipeline begins by accessing Sentinel-2 multispectral satellite tv for pc imagery by the AWS Open Information program from S3 buckets. This imagery gives 10-meter decision throughout a number of spectral bands together with RGB (seen gentle) and NIR (near-infrared), that are important for environmental monitoring.
This step filters out chips which have extreme cloud cowl and divides giant satellite tv for pc scenes into manageable 256×256 pixel chips, which permits environment friendly parallel processing and creates uniform inputs for the muse mannequin. This step additionally runs on a SageMaker AI Processing job with a customized Docker picture optimized for geospatial operations.
For every chip, this step generates:
- NetCDF datacubes (.netcdf) containing the total multispectral data
- RGB thumbnails (.png) for visualization
- Wealthy metadata (.parquet) with geolocation, timestamps, and different metadata
Step 2: Embedding technology utilizing a Clay basis mannequin
The second step transforms the preprocessed picture chips into vector embeddings utilizing the Clay v1 basis mannequin. That is probably the most computationally intensive a part of the pipeline, utilizing a number of GPU cases (ml.g5.xlarge) to effectively course of the satellite tv for pc imagery.
For every chip, this step:
- Accesses the NetCDF datacube from Amazon S3
- Normalizes the spectral bands in keeping with the Clay v1 mannequin’s enter necessities
- Generates each patch-level and sophistication token (CLS) embeddings
- Shops the embeddings as NumPy arrays (.npy) alongside the unique knowledge on S3 as intermediate retailer
Whereas Clay can use all Sentinel-2 spectral bands, our implementation makes use of RGB and NIR as enter bands to generate a 768-dimensional embedding, which offer glorious ends in our examples. Prospects can simply adapt the enter bands based mostly on their particular use-cases. These embeddings encapsulate high-level options reminiscent of vegetation patterns, city constructions, water our bodies, and land use traits—with out requiring specific function engineering.
Step 3: Embedding processing and evaluation
The third step analyzes the embeddings to extract significant insights, notably for time-series evaluation. Operating on high-memory cases, this step:
- Performs dimensionality discount on the embeddings utilizing principal element evaluation (PCA) and t-distributed stochastic neighbor embedding (t-SNE) (for use later for change detection)
- Computes cosine similarity between embeddings over time (another for change detection)
- Identifies vital adjustments within the embeddings which may point out floor adjustments
- Saves processed embeddings in Parquet format for environment friendly querying
The output contains processed embedding information that include each the unique high-dimensional vectors and their decreased representations, together with computed similarity metrics.
For change detection functions, this step establishes a baseline for every geographic location and calculates deviations from this baseline over time. These deviations, captured as vector distances, present a strong indicator of floor adjustments like deforestation, city improvement, or pure disasters.
Step 4: Consolidation and vector database integration
The ultimate pipeline step consolidates the processed embeddings right into a unified dataset and masses them into vector databases optimized for similarity search. The outputs embody consolidated embedding information, GeoJSON grid information for visualization, and configuration information for frontend functions.
The answer helps two vector database choices:
Each choices present environment friendly ANN search capabilities, enabling sub-second question efficiency. The selection between them will depend on the dimensions of deployment, integration necessities, and operational preferences.
With this sturdy knowledge processing and embedding technology basis in place, let’s discover the real-world functions enabled by the pipeline, starting with geospatial similarity search.
Geospatial similarity search
Organizations working with Earth statement knowledge have historically struggled with effectively figuring out particular panorama patterns throughout giant geographic areas. Conventional Earth statement evaluation requires specialised fashions educated on labeled datasets for every goal function. This method forces organizations right into a prolonged course of of knowledge assortment, annotation, and mannequin coaching earlier than acquiring outcomes.
In distinction, the GeoFM-powered similarity search converts satellite tv for pc imagery into 768-dimensional vector embeddings that seize the semantic essence of panorama options, eliminating the necessity for handbook function engineering and computation of specialised indices like NDVI or NDWI.
This functionality makes use of the Clay basis mannequin’s pre-training on numerous world landscapes to grasp advanced relationships between options with out specific programming. The result’s an intuitive image-to-image search functionality the place customers can choose a reference space—reminiscent of early-stage deforestation or wildfire injury—and immediately discover related patterns throughout huge territories in seconds fairly than weeks.
Similarity search implementation
Our implementation gives a streamlined workflow for locating related geographic areas utilizing the embeddings generated by the information processing pipeline. The search course of entails:
- Reference space choice: Customers choose a reference chip representing a search time period (for instance, a deforested patch, city improvement, or agricultural area)
- Search parameters: Customers specify the variety of outcomes and a similarity threshold
- Vector search execution: The system retrieves related chips utilizing cosine similarity between embeddings
- Consequence visualization: Matching chips are highlighted on the map
Let’s dive deeper on a real-world utility, taking our operating instance of detecting deforestation within the Mato Grosso area of the Brazilian Amazon. Conventional monitoring approaches typically detect forest loss too late—after vital injury has already occurred. The Clay-powered similarity search functionality affords a brand new method by enabling early detection of rising deforestation patterns earlier than they broaden into large-scale clearing operations.
Utilizing a single reference chip displaying the preliminary indicators of forest degradation—reminiscent of selective logging, small clearings, or new entry roads—analysts can immediately determine related patterns throughout huge areas of the Amazon rainforest. As demonstrated within the following instance photographs, the system successfully acknowledges the refined signatures of early-stage deforestation based mostly on a single reference picture. This functionality permits environmental safety businesses and conservation organizations to deploy sources exactly, enhancing the anti-deforestation efforts by addressing threats to stop main forest loss. Whereas a single reference chip picture led to good ends in our examples, various approaches exist, reminiscent of a median vector technique, which leverages embeddings from a number of reference photographs to reinforce the similarity search outcomes.
Ecosystem change detection
In contrast to vector-based similarity search, change detection focuses on measuring the gap between embedding vectors over time, the core assumption being that the extra distant embedding vectors are to one another, the extra dissimilar the underlying satellite tv for pc imagery is. If utilized to a single area over time, this allows you to pinpoint so known as change factors—intervals the place vital and long-lasting change in floor situations occurred.
Our resolution implements a timeline view of Sentinel-2 satellite tv for pc observations from 2018 to current. Every statement level corresponds to a singular satellite tv for pc picture, permitting for detailed temporal evaluation. Whereas embedding vectors are extremely dimensional, we use the beforehand computed PCA (and optionally t-SNE) to scale back dimensionality to a single dimension for visualization functions.
Let’s assessment a compelling instance from our evaluation of deforestation within the Amazon. The next picture is a timeseries plot of geospatial embeddings (first principal element) for a single 256×256 pixel chip. Cloudy photographs and main outliers have been eliminated.
Factors clustered intently on the y-axis point out related floor situations; sudden and chronic discontinuities within the embedding values sign vital change. Right here’s what the evaluation exhibits:
- Secure forest situations from 2018 by 2020
- A big discontinuity in embedding values throughout 2021. Nearer assessment of the underlying satellite tv for pc imagery exhibits clear proof of forest clearing and conversion to agricultural fields
- Additional transformation seen in 2024 imagery
Naturally, we’d like a approach to automate the method of change detection in order that it may be utilized at scale. On condition that we don’t sometimes have in depth changepoint coaching datasets, we’d like an unsupervised method that works with out labeled knowledge. The instinct behind unsupervised change detection is the next: determine what regular seems to be like, then spotlight giant sufficient deviations from regular and flag them as change factors; after a change level has occurred, characterize the new regular and repeat the method.
The next perform performs harmonic regression evaluation on the embeddings timeseries knowledge, particularly designed to mannequin yearly seasonality patterns. The perform matches a harmonic regression with a specified frequency (default 12 months for annual patterns) to the embedding knowledge of a baseline interval (the yr 2018 on this instance). It then generates predictions and calculates error metrics (absolute and share deviations). Giant deviations from the conventional seasonal sample point out change and might be routinely flagged utilizing thresholding.
When utilized to the chips throughout an space of statement and defining a threshold on the utmost deviation from the fitted harmonic regression, we will routinely map change depth permitting analysts to shortly zoom in on problematic areas.
Whereas this methodology performs nicely in our analyses, it’s also fairly inflexible in that it requires a cautious tuning of error thresholds and the definition of a baseline interval. There are extra refined approaches obtainable starting from general-purpose time-series analyses that automate the baseline definition and alter level detection utilizing recursive strategies (for instance, Gaussian Processes) to specialised algorithms for geospatial change detection (for instance, LandTrendr, and Steady Change Detection and Classification (CCDC)).
In sum, our method to alter detection demonstrates the ability of geospatial embedding vectors in monitoring environmental adjustments over time, offering worthwhile insights for land use monitoring, environmental safety, and concrete planning functions.
GeoFM fine-tuning in your customized use case
High quality-tuning is a particular implementation of switch studying, wherein a pre-trained basis mannequin is tailored to particular duties by focused extra coaching on specialised labeled datasets. For GeoFMs, these particular duties can goal agriculture, catastrophe monitoring or city evaluation. The mannequin retains its broad spatial understanding whereas growing experience for explicit areas, ecosystems or analytical duties. This method considerably reduces computational and knowledge necessities in comparison with constructing specialised fashions from scratch, with out sacrificing accuracy. High quality-tuning sometimes entails preserving the pre-trained Clay’s encoder—which has already discovered wealthy representations of spectral patterns, spatial relationships, and temporal dynamics from huge satellite tv for pc imagery, whereas attaching and coaching a specialised task-specific head.
For pixel-wise prediction duties—reminiscent of land use segmentation—the specialised head is usually a decoder structure, whereas for class-level outputs (classification duties) the top might be as primary as a multilayer perceptron community. Coaching focuses completely on the brand new decoder that captures the function representations from mannequin’s frozen encoder and regularly transforms them again to full-resolution photographs the place every pixel is classed in keeping with its land use sort.
The segmentation framework combines the highly effective pre-trained Clay encoder with an environment friendly convolutional decoder, taking Clay’s wealthy understanding of satellite tv for pc imagery and changing it into detailed land use maps. The light-weight decoder options convolutional layers and pixel shuffle upsampling methods that seize the function representations from Clay’s frozen encoder and regularly transforms them again to full-resolution photographs the place every pixel is classed in keeping with its land use sort. By freezing the encoder (which incorporates 24 transformer heads and 16 consideration heads) and solely coaching the compact decoder, the mannequin achieves a very good stability between computational effectivity and segmentation accuracy.
We utilized this segmentation structure on a labeled land use land cowl (LULC) dataset from Influence Observatory and hosted on the Amazon Registry of Open Information. For illustrative functions, we once more targeted on our operating instance from Brazil’s Mato Grosso area. We educated the decoder head for 10 epochs which took 17 minutes complete and tracked intersection over union (IOU) and F1 rating as segmentation accuracy metrics. After only one coaching epoch, the mannequin already achieved 85.7% validation IOU. With the total 10 epochs accomplished, efficiency elevated to a formidable 92.4% IOU and 95.6% F1 rating. Within the following picture, we present floor fact satellite tv for pc imagery (higher) and the mannequin’s predictions (decrease). The visible comparability highlights how precisely this method can classify totally different land use classes.
Conclusion
Novel GeoFMs present an encouraging new method to geospatial analytics. By means of their in depth pre-training, these fashions have integrated a deep implicit understanding of geospatial knowledge and can be utilized out-of-the-box for high-impact use instances reminiscent of similarity search or change detection. They’ll additionally function the premise for specialised fashions utilizing a fine-tuning course of that’s considerably much less data-hungry (fewer labeled knowledge wanted) and has decrease compute necessities.
On this publish, now we have proven how one can deploy a state-of-the-art GeoFM (Clay) on AWS and have explored one particular use case – monitoring deforestation within the Amazon rainforest – in larger element. The identical method is relevant to a big number of business use case. For instance, insurance coverage corporations can use an analogous method to ours to evaluate injury after pure disasters together with hurricanes, floods or fires and preserve monitor of their insured belongings. Agricultural organizations can use GeoFMs for crop sort identification, crop yield predictions, or different use instances. We additionally envision high-impact use instances in industries like city planning, emergency and catastrophe response, provide chain and world commerce, sustainability and environmental modeling, and plenty of others. To get began making use of GeoFMs to your personal earth statement use case, try the accompanying GitHub repository, which has the stipulations and a step-by-step walkthrough to run it by yourself space of curiosity.
Concerning the Authors
Dr. Karsten Schroer is a Senior Machine Studying (ML) Prototyping Architect at AWS, targeted on serving to clients leverage synthetic intelligence (AI), ML, and generative AI applied sciences. With deep ML experience, he collaborates with corporations throughout industries to design and implement data- and AI-driven options that generate enterprise worth. Karsten holds a PhD in utilized ML.
Bishesh Adhikari is a Senior ML Prototyping Architect at AWS with over a decade of expertise in software program engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS clients to construct options for difficult issues by co-development. His experience accelerates clients’ journey from idea to manufacturing, tackling advanced use instances throughout numerous industries. In his free time, he enjoys mountain climbing, touring, and spending time with household and pals.
Dr. Iza Moise is a Senior Machine Studying (ML) Prototyping Architect at AWS, with experience in each conventional ML and superior methods like basis fashions and imaginative and prescient transformers. She focuses on utilized ML throughout numerous scientific fields, publishing and reviewing at Amazon’s inside ML conferences. Her power lies in translating theoretical advances into sensible options that ship measurable impression by considerate implementation.