Feature Engineering with Geospatial Data: Binned Statistics in GEE

In quantitative research, generating novel features from alternative datasets is often a primary source of identifying variation and predictive performance. Geospatial data, such as weather patterns and agricultural metrics, provides a rich source for these signals. A common feature engineering task is to aggregate one spatial dataset by the discrete bins of another—for example, calculating the total crop area (a stock) exposed to different temperature regimes (a flow). This process creates a structured feature set from unstructured spatial data.

A direct iteration over each bin is computationally prohibitive for large datasets (it can take hours for one set of a few hundred images of 0.25 degree resolution). The efficient solution in Google Earth Engine is to use a grouped reducer (ee.Reducer.sum().group()). This method effectively vectorizes the computation, performing a single-pass analysis over the entire set of images.

The workflow is straightforward:

Value Image: An image where each pixel represents the continuous variable to be aggregated (e.g., soybean area in km²).
Group Image: A categorical image where each pixel value is an integer corresponding to a bin (e.g., 0 for temp < 22°C, 1 for temp 22-28°C).
Execution: A single reduceRegion call calculates the sum of the value image for each category in the group image.

Implementation Example: Crop Exposure Feature

The following MWE provides a template for this feature engineering task. It calculates the area of soybeans exposed to different temperature bins using USDA CDL (crop data) and ERA5 (weather data). The output can be used to construct time-series features for causal analysis or predictive models.


import ee

# Authenticate and initialize Earth Engine.
# ee.Authenticate()
ee.Initialize(project="your-project-id")

# --- Define Common Parameters ---
# Define a region of interest (e.g., a part of the US Corn Belt).
geometry = ee.Geometry.Rectangle([-96, 41, -93, 43]) # Iowa, USA

# Define the target projection using the ERA5 weather dataset.
# We will align our crop data to this grid.
ERA5_PROJECTION = ee.ImageCollection('ECMWF/ERA5/DAILY').first().projection()


# --- 1. Create the "Value" Image (Crop Area in km²) ---
# Load the Cropland Data Layer (CDL) and create a binary mask for soybeans (code 5).
cdl = ee.Image('USDA/NASS/CDL/2022').select('cropland')
soybean_mask = cdl.eq(5)

# Resample the 30m crop mask to the coarse ERA5 grid, calculating the
# soybean fraction within each ERA5 pixel.
soybean_fraction = soybean_mask.reduceResolution(
    reducer=ee.Reducer.mean(),
    maxPixels=1024
).reproject(crs=ERA5_PROJECTION)

# Convert fraction to area in km² to create the final value image.
crop_area_image = soybean_fraction.multiply(ee.Image.pixelArea().divide(1e6)).rename('crop_area')


# --- 2. Create the "Group" Image (Temperature Bins) ---
# Load historical temperature data for a specific day.
era5_image = ee.ImageCollection('ECMWF/ERA5/DAILY').filterDate('2022-07-20').first()

# Convert from Kelvin to Celsius.
temp_celsius = era5_image.select('mean_2m_air_temperature').subtract(273.15).rename('temperature')

# Classify the temperature image into integer bins.
bin_image = ee.Image(-1).rename('temp_bin') # Default value
bin_image = bin_image.where(temp_celsius.lt(22), 0) # Bin 0
bin_image = bin_image.where(temp_celsius.gte(22).And(temp_celsius.lt(28)), 1) # Bin 1
bin_image = bin_image.where(temp_celsius.gte(28), 2) # Bin 2


# --- 3. Combine and Reduce ---
# Combine the two bands and execute the grouped reducer.
combined_image = crop_area_image.addBands(bin_image)
stats = combined_image.reduceRegion(
    reducer=ee.Reducer.sum().group(
        groupField=1, # Group by the second band ('temp_bin').
        groupName='bin_id',
    ),
    geometry=geometry,
    scale=ERA5_PROJECTION.nominalScale(),
    maxPixels=1e10
)

# --- 4. Print the Result ---
# The output is a dictionary, ready for export.
print(stats.getInfo())
# Example output: {'groups': [{'bin_id': 1, 'sum': 13589.73}, {'bin_id': 2, 'sum': 24781.51}]}

Note: Instead of pre-calculating the crop area and running the reduce on that, it is common practice to run the reducer group directly on the categorical crop raster and it will sum up the number of pixels with 1’s for each temperature bin, weighted by pixel area. However, there are two reasons I pre-calculate the crop area directly. First, if we leave the crop image at it’s original resolution, we need to repeat the same expensive operation over each image: calculating the fraction of crop area in each pixel of the lower resolution temperature image, then multiplying that fraction by each pixel’s area (which is different due to the data’s projection). But the fraction of crop area in each pixel of the temperature image does not change with different temperature images, so we can save many expensive operations by pre-calculating a lower-resolution crop area image. In fact, precomputing this crop area image outside the map loop converted a many-hour google earth engine task into a 10-second task for a set of a few hundred images! This a good example of when the earth engine code interpreter is not as smart as we want it to be — we need to be vigilant and test alternative orders of our operations.

Second, the CDL dataset is at a high resolution (30m) compared to the ERA5 weather data (~31km), and the reducer cannot directly process the number of high-resolution categorical crop pixels within each temperature pixel easily. In fact, in the example above, the soybean_fraction actually needs to be calculated over multiple reduceResolution calls. You can read more about the issue of aggregating from high resolution to low resolution here.

Integration into a Research Project

The utility of this method is in its scalability. For a full analysis, this calculation would be mapped over a time-series of daily weather data. This procedure generates a panel dataset where each row represents a date, and the columns represent the total crop area within each pre-defined temperature bin. These columns then serve as robust predictor variables in a model designed to forecast characteristics of agricultural commodities or other related outcomes.

Have a complicated geospatial task that you want to implement in Google Earth Engine but haven’t been able to figure out how to do it efficiently? I might have done it before, so feel free to email me with a description or check out my Earth Engine Guide (work in progress, so apologies for the formatting). There are a few niche tricks to speed up GEE that are not well documented other places on the internet.

Feature Engineering with Geospatial Data: Binned Statistics in GEE

Implementation Example: Crop Exposure Feature

Integration into a Research Project

Published by acwatt

Leave a comment Cancel reply

Implementation Example: Crop Exposure Feature

Integration into a Research Project

Share this:

Related

Published by acwatt

Leave a comment Cancel reply