Skip to content

Query Pack Data with DuckDB

Spatial Packs store vector layers as GeoParquet files — a columnar format that DuckDB reads natively. No import step, no database server, no data copying. Point DuckDB at a .parquet file and start querying with SQL.

This recipe shows how to query, filter, aggregate, and export data from Spatial Pack layers using DuckDB.

  • DuckDB installed (CLI or Python library)
  • A built Spatial Pack with GeoParquet layers (.parquet files in the layers/ directory)

1. Open DuckDB and load the spatial extension

Section titled “1. Open DuckDB and load the spatial extension”

DuckDB ships with a spatial extension for geometry operations. Install and load it before running spatial queries.

-- Start DuckDB (in-memory)
-- Then install the spatial extension
INSTALL spatial;
LOAD spatial;

What just happened: You installed the spatial extension, which adds geometry types and spatial functions (like ST_Area, ST_Contains, and ST_Distance) to DuckDB. The extension only needs to be installed once — after that, LOAD spatial activates it in each session.

Query a GeoParquet layer directly using read_parquet(). No import or table creation needed.

-- Preview the first 5 rows
SELECT * FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
LIMIT 5;

Expected output:

┌─────────────┬─────────────┬──────────┬────────────────────────────┐
│ lot_number │ plan_number │ area_m2 │ geometry │
│ varchar │ varchar │ double │ geometry │
├─────────────┼─────────────┼──────────┼────────────────────────────┤
│ 101 │ P-23456 │ 845.20 │ POLYGON ((115.86 -31.95... │
│ 102 │ P-23456 │ 912.50 │ POLYGON ((115.86 -31.95... │
│ 201 │ P-34567 │ 1205.00 │ POLYGON ((115.87 -31.94... │
│ 202 │ P-34567 │ 678.30 │ POLYGON ((115.87 -31.94... │
│ 301 │ P-45678 │ 2340.10 │ POLYGON ((115.88 -31.93... │
└─────────────┴─────────────┴──────────┴────────────────────────────┘

What just happened: DuckDB read the Parquet file’s metadata to discover columns and types, then fetched only the first 5 rows. The geometry column contains WKB-encoded geometries that the spatial extension renders as WKT in the output.

Use standard SQL to filter features by attribute values.

-- Find large lots (over 2000 square meters)
SELECT lot_number, area_m2
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
WHERE area_m2 > 2000
ORDER BY area_m2 DESC;

What just happened: DuckDB pushed the filter into the Parquet reader, scanning only the lot_number and area_m2 columns and skipping row groups where area_m2 is below the threshold. Columnar storage makes this efficient — columns you do not select are never read from disk.

Run aggregate functions to summarize a layer.

-- Summary statistics for the cadastre layer
SELECT
count(*) AS total_lots,
round(avg(area_m2), 1) AS avg_area,
round(min(area_m2), 1) AS min_area,
round(max(area_m2), 1) AS max_area,
round(sum(area_m2) / 10000, 1) AS total_hectares
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet');

Expected output:

┌────────────┬──────────┬──────────┬──────────┬─────────────────┐
│ total_lots │ avg_area │ min_area │ max_area │ total_hectares │
│ int64 │ double │ double │ double │ double │
├────────────┼──────────┼──────────┼──────────┼─────────────────┤
│ 12847 │ 1052.3 │ 45.2 │ 125840.0 │ 13519.2 │
└────────────┴──────────┴──────────┴──────────┴─────────────────┘

What just happened: DuckDB computed summary statistics across all 12,847 features without loading the full dataset into memory. The columnar layout means only the area_m2 column was read for these aggregations.

Use spatial functions to analyze geometry properties.

-- Compute area from geometry (in square degrees, approximate)
SELECT
lot_number,
area_m2,
round(ST_Area(geometry), 8) AS geom_area,
ST_GeometryType(geometry) AS geom_type
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
LIMIT 5;

What just happened: The ST_Area function computed the area of each geometry in the coordinate system’s units (square degrees for WGS84). For accurate area in square meters, use a projected CRS or the pre-computed area_m2 column that the Pipeline’s metrics.compute stage adds.

Query across two pack layers by joining on spatial relationships or shared attributes.

-- Count lots per road segment's bounding area
SELECT
r.road_name,
count(c.lot_number) AS nearby_lots
FROM read_parquet('./my-solar-pack/layers/roads.parquet') r
JOIN read_parquet('./my-solar-pack/layers/cadastre.parquet') c
ON ST_Intersects(r.geometry, c.geometry)
GROUP BY r.road_name
ORDER BY nearby_lots DESC
LIMIT 10;

What just happened: DuckDB joined two GeoParquet files using a spatial intersection predicate. This finds which cadastre lots intersect each road segment — useful for access analysis in site selection workflows.

Save query results to CSV or JSON for use in other tools.

-- Export to CSV
COPY (
SELECT lot_number, area_m2
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
WHERE area_m2 > 5000
) TO './large-lots.csv' (HEADER, DELIMITER ',');
-- Export to JSON
COPY (
SELECT lot_number, area_m2
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
WHERE area_m2 > 5000
) TO './large-lots.json' (FORMAT JSON);

What just happened: DuckDB wrote the filtered query results to a file. The COPY command streams results directly to disk without loading the full result set into memory, making it efficient for large exports.