Query Pack Data with DuckDB
Spatial Packs store vector layers as GeoParquet files — a columnar format that DuckDB reads natively. No import step, no database server, no data copying. Point DuckDB at a .parquet file and start querying with SQL.
This recipe shows how to query, filter, aggregate, and export data from Spatial Pack layers using DuckDB.
Prerequisites
Section titled “Prerequisites”- DuckDB installed (CLI or Python library)
- A built Spatial Pack with GeoParquet layers (
.parquetfiles in thelayers/directory)
1. Open DuckDB and load the spatial extension
Section titled “1. Open DuckDB and load the spatial extension”DuckDB ships with a spatial extension for geometry operations. Install and load it before running spatial queries.
-- Start DuckDB (in-memory)-- Then install the spatial extensionINSTALL spatial;LOAD spatial;import duckdb
conn = duckdb.connect()conn.execute("INSTALL spatial")conn.execute("LOAD spatial")What just happened: You installed the spatial extension, which adds geometry types and spatial functions (like ST_Area, ST_Contains, and ST_Distance) to DuckDB. The extension only needs to be installed once — after that, LOAD spatial activates it in each session.
2. Read a GeoParquet file
Section titled “2. Read a GeoParquet file”Query a GeoParquet layer directly using read_parquet(). No import or table creation needed.
-- Preview the first 5 rowsSELECT * FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')LIMIT 5;result = conn.execute(""" SELECT * FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') LIMIT 5""").fetchdf()print(result)Expected output:
┌─────────────┬─────────────┬──────────┬────────────────────────────┐│ lot_number │ plan_number │ area_m2 │ geometry ││ varchar │ varchar │ double │ geometry │├─────────────┼─────────────┼──────────┼────────────────────────────┤│ 101 │ P-23456 │ 845.20 │ POLYGON ((115.86 -31.95... ││ 102 │ P-23456 │ 912.50 │ POLYGON ((115.86 -31.95... ││ 201 │ P-34567 │ 1205.00 │ POLYGON ((115.87 -31.94... ││ 202 │ P-34567 │ 678.30 │ POLYGON ((115.87 -31.94... ││ 301 │ P-45678 │ 2340.10 │ POLYGON ((115.88 -31.93... │└─────────────┴─────────────┴──────────┴────────────────────────────┘What just happened: DuckDB read the Parquet file’s metadata to discover columns and types, then fetched only the first 5 rows. The geometry column contains WKB-encoded geometries that the spatial extension renders as WKT in the output.
3. Filter with SQL WHERE clauses
Section titled “3. Filter with SQL WHERE clauses”Use standard SQL to filter features by attribute values.
-- Find large lots (over 2000 square meters)SELECT lot_number, area_m2FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')WHERE area_m2 > 2000ORDER BY area_m2 DESC;large_lots = conn.execute(""" SELECT lot_number, area_m2 FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') WHERE area_m2 > 2000 ORDER BY area_m2 DESC""").fetchdf()print(f"Found {len(large_lots)} large lots")What just happened: DuckDB pushed the filter into the Parquet reader, scanning only the lot_number and area_m2 columns and skipping row groups where area_m2 is below the threshold. Columnar storage makes this efficient — columns you do not select are never read from disk.
4. Aggregate data
Section titled “4. Aggregate data”Run aggregate functions to summarize a layer.
-- Summary statistics for the cadastre layerSELECT count(*) AS total_lots, round(avg(area_m2), 1) AS avg_area, round(min(area_m2), 1) AS min_area, round(max(area_m2), 1) AS max_area, round(sum(area_m2) / 10000, 1) AS total_hectaresFROM read_parquet('./my-solar-pack/layers/cadastre.parquet');stats = conn.execute(""" SELECT count(*) AS total_lots, round(avg(area_m2), 1) AS avg_area, round(sum(area_m2) / 10000, 1) AS total_hectares FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')""").fetchone()print(f"Total lots: {stats[0]}, Avg area: {stats[1]} m2, Total: {stats[2]} ha")Expected output:
┌────────────┬──────────┬──────────┬──────────┬─────────────────┐│ total_lots │ avg_area │ min_area │ max_area │ total_hectares ││ int64 │ double │ double │ double │ double │├────────────┼──────────┼──────────┼──────────┼─────────────────┤│ 12847 │ 1052.3 │ 45.2 │ 125840.0 │ 13519.2 │└────────────┴──────────┴──────────┴──────────┴─────────────────┘What just happened: DuckDB computed summary statistics across all 12,847 features without loading the full dataset into memory. The columnar layout means only the area_m2 column was read for these aggregations.
5. Run spatial queries
Section titled “5. Run spatial queries”Use spatial functions to analyze geometry properties.
-- Compute area from geometry (in square degrees, approximate)SELECT lot_number, area_m2, round(ST_Area(geometry), 8) AS geom_area, ST_GeometryType(geometry) AS geom_typeFROM read_parquet('./my-solar-pack/layers/cadastre.parquet')LIMIT 5;spatial = conn.execute(""" SELECT lot_number, ST_Area(geometry) AS geom_area, ST_GeometryType(geometry) AS geom_type FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') LIMIT 5""").fetchdf()print(spatial)What just happened: The ST_Area function computed the area of each geometry in the coordinate system’s units (square degrees for WGS84). For accurate area in square meters, use a projected CRS or the pre-computed area_m2 column that the Pipeline’s metrics.compute stage adds.
6. Join multiple layers
Section titled “6. Join multiple layers”Query across two pack layers by joining on spatial relationships or shared attributes.
-- Count lots per road segment's bounding areaSELECT r.road_name, count(c.lot_number) AS nearby_lotsFROM read_parquet('./my-solar-pack/layers/roads.parquet') rJOIN read_parquet('./my-solar-pack/layers/cadastre.parquet') c ON ST_Intersects(r.geometry, c.geometry)GROUP BY r.road_nameORDER BY nearby_lots DESCLIMIT 10;joined = conn.execute(""" SELECT r.road_name, count(c.lot_number) AS nearby_lots FROM read_parquet('./my-solar-pack/layers/roads.parquet') r JOIN read_parquet('./my-solar-pack/layers/cadastre.parquet') c ON ST_Intersects(r.geometry, c.geometry) GROUP BY r.road_name ORDER BY nearby_lots DESC LIMIT 10""").fetchdf()print(joined)What just happened: DuckDB joined two GeoParquet files using a spatial intersection predicate. This finds which cadastre lots intersect each road segment — useful for access analysis in site selection workflows.
7. Export results
Section titled “7. Export results”Save query results to CSV or JSON for use in other tools.
-- Export to CSVCOPY ( SELECT lot_number, area_m2 FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') WHERE area_m2 > 5000) TO './large-lots.csv' (HEADER, DELIMITER ',');
-- Export to JSONCOPY ( SELECT lot_number, area_m2 FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') WHERE area_m2 > 5000) TO './large-lots.json' (FORMAT JSON);# Export to CSVconn.execute(""" COPY ( SELECT lot_number, area_m2 FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') WHERE area_m2 > 5000 ) TO './large-lots.csv' (HEADER, DELIMITER ',')""")
# Or use pandasdf = conn.execute(""" SELECT lot_number, area_m2 FROM read_parquet('./my-solar-pack/layers/cadastre.parquet') WHERE area_m2 > 5000""").fetchdf()df.to_csv("./large-lots.csv", index=False)What just happened: DuckDB wrote the filtered query results to a file. The COPY command streams results directly to disk without loading the full result set into memory, making it efficient for large exports.
Next steps
Section titled “Next steps”- Data Formats — Why GeoParquet, PMTiles, and H3 indexing
- Convert Shapefiles to GeoParquet — Create GeoParquet files from your existing data
- Build a Pipeline from YAML — Automate pack builds with metric computation and tile generation