Query Pack Data with DuckDB

Spatial Packs store vector layers as GeoParquet files — a columnar format that DuckDB reads natively. No import step, no database server, no data copying. Point DuckDB at a .parquet file and start querying with SQL.

This recipe shows how to query, filter, aggregate, and export data from Spatial Pack layers using DuckDB.

Prerequisites

DuckDB installed (CLI or Python library)
A built Spatial Pack with GeoParquet layers (.parquet files in the layers/ directory)

Steps

1. Open DuckDB and load the spatial extension

DuckDB ships with a spatial extension for geometry operations. Install and load it before running spatial queries.

DuckDB CLI
Python

-- Start DuckDB (in-memory)
-- Then install the spatial extension
INSTALL spatial;
LOAD spatial;

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL spatial")
conn.execute("LOAD spatial")

What just happened: You installed the spatial extension, which adds geometry types and spatial functions (like ST_Area, ST_Contains, and ST_Distance) to DuckDB. The extension only needs to be installed once — after that, LOAD spatial activates it in each session.

2. Read a GeoParquet file

Query a GeoParquet layer directly using read_parquet(). No import or table creation needed.

DuckDB CLI
Python

-- Preview the first 5 rows
SELECT * FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
LIMIT 5;

result = conn.execute("""
    SELECT * FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
    LIMIT 5
""").fetchdf()
print(result)

Expected output:

┌─────────────┬─────────────┬──────────┬────────────────────────────┐
│ lot_number  │ plan_number │ area_m2  │         geometry           │
│   varchar   │   varchar   │  double  │         geometry           │
├─────────────┼─────────────┼──────────┼────────────────────────────┤
│ 101         │ P-23456     │  845.20  │ POLYGON ((115.86 -31.95... │
│ 102         │ P-23456     │  912.50  │ POLYGON ((115.86 -31.95... │
│ 201         │ P-34567     │ 1205.00  │ POLYGON ((115.87 -31.94... │
│ 202         │ P-34567     │  678.30  │ POLYGON ((115.87 -31.94... │
│ 301         │ P-45678     │ 2340.10  │ POLYGON ((115.88 -31.93... │
└─────────────┴─────────────┴──────────┴────────────────────────────┘

What just happened: DuckDB read the Parquet file’s metadata to discover columns and types, then fetched only the first 5 rows. The geometry column contains WKB-encoded geometries that the spatial extension renders as WKT in the output.

3. Filter with SQL WHERE clauses

Use standard SQL to filter features by attribute values.

DuckDB CLI
Python

-- Find large lots (over 2000 square meters)
SELECT lot_number, area_m2
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
WHERE area_m2 > 2000
ORDER BY area_m2 DESC;

large_lots = conn.execute("""
    SELECT lot_number, area_m2
    FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
    WHERE area_m2 > 2000
    ORDER BY area_m2 DESC
""").fetchdf()
print(f"Found {len(large_lots)} large lots")

What just happened: DuckDB pushed the filter into the Parquet reader, scanning only the lot_number and area_m2 columns and skipping row groups where area_m2 is below the threshold. Columnar storage makes this efficient — columns you do not select are never read from disk.

4. Aggregate data

Run aggregate functions to summarize a layer.

DuckDB CLI
Python

-- Summary statistics for the cadastre layer
SELECT
  count(*) AS total_lots,
  round(avg(area_m2), 1) AS avg_area,
  round(min(area_m2), 1) AS min_area,
  round(max(area_m2), 1) AS max_area,
  round(sum(area_m2) / 10000, 1) AS total_hectares
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet');

stats = conn.execute("""
    SELECT
      count(*) AS total_lots,
      round(avg(area_m2), 1) AS avg_area,
      round(sum(area_m2) / 10000, 1) AS total_hectares
    FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
""").fetchone()
print(f"Total lots: {stats[0]}, Avg area: {stats[1]} m2, Total: {stats[2]} ha")

Expected output:

┌────────────┬──────────┬──────────┬──────────┬─────────────────┐
│ total_lots │ avg_area │ min_area │ max_area │ total_hectares  │
│   int64    │  double  │  double  │  double  │     double      │
├────────────┼──────────┼──────────┼──────────┼─────────────────┤
│      12847 │   1052.3 │     45.2 │ 125840.0 │         13519.2 │
└────────────┴──────────┴──────────┴──────────┴─────────────────┘

What just happened: DuckDB computed summary statistics across all 12,847 features without loading the full dataset into memory. The columnar layout means only the area_m2 column was read for these aggregations.

5. Run spatial queries

Use spatial functions to analyze geometry properties.

DuckDB CLI
Python

-- Compute area from geometry (in square degrees, approximate)
SELECT
  lot_number,
  area_m2,
  round(ST_Area(geometry), 8) AS geom_area,
  ST_GeometryType(geometry) AS geom_type
FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
LIMIT 5;

spatial = conn.execute("""
    SELECT
      lot_number,
      ST_Area(geometry) AS geom_area,
      ST_GeometryType(geometry) AS geom_type
    FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
    LIMIT 5
""").fetchdf()
print(spatial)

What just happened: The ST_Area function computed the area of each geometry in the coordinate system’s units (square degrees for WGS84). For accurate area in square meters, use a projected CRS or the pre-computed area_m2 column that the Pipeline’s metrics.compute stage adds.

6. Join multiple layers

Query across two pack layers by joining on spatial relationships or shared attributes.

DuckDB CLI
Python

-- Count lots per road segment's bounding area
SELECT
  r.road_name,
  count(c.lot_number) AS nearby_lots
FROM read_parquet('./my-solar-pack/layers/roads.parquet') r
JOIN read_parquet('./my-solar-pack/layers/cadastre.parquet') c
  ON ST_Intersects(r.geometry, c.geometry)
GROUP BY r.road_name
ORDER BY nearby_lots DESC
LIMIT 10;

joined = conn.execute("""
    SELECT r.road_name, count(c.lot_number) AS nearby_lots
    FROM read_parquet('./my-solar-pack/layers/roads.parquet') r
    JOIN read_parquet('./my-solar-pack/layers/cadastre.parquet') c
      ON ST_Intersects(r.geometry, c.geometry)
    GROUP BY r.road_name
    ORDER BY nearby_lots DESC
    LIMIT 10
""").fetchdf()
print(joined)

What just happened: DuckDB joined two GeoParquet files using a spatial intersection predicate. This finds which cadastre lots intersect each road segment — useful for access analysis in site selection workflows.

7. Export results

Save query results to CSV or JSON for use in other tools.

DuckDB CLI
Python

-- Export to CSV
COPY (
  SELECT lot_number, area_m2
  FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
  WHERE area_m2 > 5000
) TO './large-lots.csv' (HEADER, DELIMITER ',');

-- Export to JSON
COPY (
  SELECT lot_number, area_m2
  FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
  WHERE area_m2 > 5000
) TO './large-lots.json' (FORMAT JSON);

# Export to CSV
conn.execute("""
    COPY (
      SELECT lot_number, area_m2
      FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
      WHERE area_m2 > 5000
    ) TO './large-lots.csv' (HEADER, DELIMITER ',')
""")

# Or use pandas
df = conn.execute("""
    SELECT lot_number, area_m2
    FROM read_parquet('./my-solar-pack/layers/cadastre.parquet')
    WHERE area_m2 > 5000
""").fetchdf()
df.to_csv("./large-lots.csv", index=False)

What just happened: DuckDB wrote the filtered query results to a file. The COPY command streams results directly to disk without loading the full result set into memory, making it efficient for large exports.

Next steps

Data Formats — Why GeoParquet, PMTiles, and H3 indexing
Convert Shapefiles to GeoParquet — Create GeoParquet files from your existing data
Build a Pipeline from YAML — Automate pack builds with metric computation and tile generation