2D Tiling¶

This guide covers the Rust-accelerated 2D vector tiling pipeline in mudm-tools. You feed it GeoJSON (or read points/polygons straight out of Parquet), it builds a quadtree from min_zoom to max_zoom, clips every feature into tiles, and writes the result either as PBF (Mapbox Vector Tiles) for web viewers or as tiled Parquet for ML pipelines.

The engine is the compiled class StreamingTileGenerator2D from mudm_tools._rs. Thin Python helpers in mudm_tools.tiling2d wrap its output methods and provide readers and a few maintenance utilities.

Where things live

Engine class: from mudm_tools._rs import StreamingTileGenerator2D (canonical import).
Python helpers/readers: from mudm_tools.tiling2d import generate_pbf, read_pbf, generate_parquet, read_parquet, ...
Runnable end-to-end example: python -m mudm_tools.examples.tiling_rust (source: src/mudm_tools/examples/tiling_rust.py).

For the legacy pure-Python tiling modules (mudm2vt, tilewriter, tilereader), see Legacy pipeline. For autodoc API listings, see the Python API reference and the CLI reference.

Quick start¶

The fastest way to produce tiles is: construct a generator, add GeoJSON, then call a generate_* helper.

PBF (web viewers)Parquet (ML pipelines)Command line

from mudm_tools._rs import StreamingTileGenerator2D
from mudm_tools.tiling2d import generate_pbf, read_pbf

# max_zoom and buffer here are EXAMPLE values, not the API defaults
# (defaults are max_zoom=4, buffer=0.0). buffer is in normalized [0,1] space:
# 64 px at MVT extent 4096 => 64/4096.
gen = StreamingTileGenerator2D(min_zoom=0, max_zoom=7, buffer=64 / 4096)

geojson_str = open("data.json").read()
bounds = (0.0, 0.0, 10000.0, 10000.0)   # world bbox: (xmin, ymin, xmax, ymax)
gen.add_geojson(geojson_str, bounds)

# Write the {z}/{x}/{y}.pbf tree + metadata.json under tiles/
n_tiles = generate_pbf(gen, "tiles/", bounds, simplify=True)
print(f"Wrote {n_tiles} PBF tiles")

# Read tiles back (decoded to world coordinates)
features = read_pbf("tiles/", bounds, zoom=0)

from mudm_tools._rs import StreamingTileGenerator2D
from mudm_tools.tiling2d import generate_parquet, read_parquet

gen = StreamingTileGenerator2D(min_zoom=0, max_zoom=7, buffer=64 / 4096)

geojson_str = open("data.json").read()
bounds = (0.0, 0.0, 10000.0, 10000.0)
gen.add_geojson(geojson_str, bounds)

# Default (partitioned=False) => a SINGLE .parquet file with a `zoom` column
n_rows = generate_parquet(gen, "output.parquet", bounds, simplify=True)
print(f"Wrote {n_rows} rows")

# Read with optional zoom/tile filtering (predicate pushdown)
rows = read_parquet("output.parquet", zoom=0)

# Random polygons -> single-file tiled Parquet (tiles_2d.parquet)
uv run python -m mudm_tools.examples.tiling_rust

# Your own GeoJSON, higher zoom, Hive-partitioned output
uv run python -m mudm_tools.examples.tiling_rust my_data.json \
    --max-zoom 6 --partitioned

# PBF vector tiles instead of Parquet
uv run python -m mudm_tools.examples.tiling_rust my_data.json --pbf

How the pipeline works

For each feature the generator projects world coordinates into [0,1]² against world_bounds, clips the geometry through the quadtree for every zoom in min_zoom..=max_zoom, and writes Fragment2D records to per-process temp shard files (shard_NNN.mf2d). The generate_* methods then read those fragments back, transform positions to world coordinates as f32, and encode them as MVT and/or Parquet. Because storage is f32, output coordinates are precision-limited.

`StreamingTileGenerator2D`¶

The streaming quadtree generator. Construct it directly from mudm_tools._rs.

from mudm_tools._rs import StreamingTileGenerator2D

gen = StreamingTileGenerator2D(min_zoom=0, max_zoom=4, buffer=0.0, temp_dir=None)

These are the real defaults

The constructor defaults are min_zoom=0, max_zoom=4, buffer=0.0, temp_dir=None. The max_zoom=7 and buffer=64/4096 you see throughout the examples are example values, not defaults. Any claim that the default max_zoom is 7 is wrong — the source default is 4.

Constructor parameters¶

Parameter	Type	Default	Description
`min_zoom`	int	`0`	Minimum quadtree zoom level.
`max_zoom`	int	`4`	Maximum / leaf zoom level.
`buffer`	float	`0.0`	Tile buffer in normalized `[0,1]` space (a fraction of the full extent, not per-tile units). For PBF, examples pass `buffer=64/4096`.
`temp_dir`	str \| None	`None`	Base directory for fragment shard files; falls back to the OS temp dir. The actual fragment directory is `<temp_dir>/microjson_frags2d_<pid>_<genid>`.

Add features before you generate

Calling any generate_* (or the internal _collect_parquet_data / _init_parquet_stream) method consumes the single-feature writer. After that, add_feature / add_geojson raise RuntimeError: Cannot add features after generate. Add everything first, then generate once.

Geometry-type codes¶

The geom_type field is an integer throughout the subsystem:

Code	Geometry
`1`	Point
`2`	LineString
`3`	Polygon

A GeoJSON MultiPolygon is flattened into a single POLYGON feature carrying all rings; a GeometryCollection is recursed.

Ingestion methods¶

There are several ways to add features. All ingestion must happen before any generate_* call.

`add_geojson`¶

Parse a GeoJSON string (a Feature, FeatureCollection, or bare Geometry), project each geometry, clip through the quadtree, and return the list of assigned feature ids.

def add_geojson(self, json_str: str, bounds: tuple[float, float, float, float]) -> list[int]

Parameter	Type	Description
`json_str`	str	GeoJSON text. Supports `Point`, `MultiPoint`, `LineString`, `MultiLineString`, `Polygon`, `MultiPolygon`, `GeometryCollection`.
`bounds`	tuple	World bbox `(xmin, ymin, xmax, ymax)` used for normalization. A degenerate axis (`max == min`) uses span `1.0`.

Returns: list[int] — the assigned feature ids.

gen = StreamingTileGenerator2D(min_zoom=0, max_zoom=7)
fids = gen.add_geojson(open("data.json").read(), (0.0, 0.0, 10000.0, 10000.0))
print(f"Added {len(fids)} features")

Properties become tags with these type mappings: String → Str, integer Number → Int, float Number → Float, Bool → Bool. Arrays, objects, and null are skipped.

Geometry minimums: LineStrings need ≥ 4 coordinate values (≥ 2 vertices); polygon rings need ≥ 3 vertices.

Point decimation

Point features are thinned at coarse zooms. At a zoom tz < max_zoom, a point is kept only if fid % (1 << (max_zoom - tz)) == 0. So at max_zoom every point survives; one level coarser keeps every other point, and so on. This applies to add_geojson and add_parquet_points, but not to add_geojson_files. It changes point counts at coarse zooms — expect fewer point rows there.

`add_geojson_files`¶

Parallel (rayon) bulk ingest of many GeoJSON files, with the GIL released. Reads, parses, projects, clips, and writes each file across threads using per-thread shard writers.

def add_geojson_files(self, paths: list[str], bounds: tuple[float, float, float, float]) -> list[int]

Parameter	Type	Description
`paths`	list[str]	GeoJSON file paths. Unreadable or invalid-JSON files are silently skipped (yield no features).
`bounds`	tuple	World bbox `(xmin, ymin, xmax, ymax)` for normalization.

Returns: list[int] — assigned feature ids, sorted ascending.

fids = gen.add_geojson_files(
    ["roi_001.geojson", "roi_002.geojson", "roi_003.geojson"],
    (0.0, 0.0, 10000.0, 10000.0),
)

No point decimation here

Unlike add_geojson, this bulk path does not apply point decimation — every point is kept at every zoom. Feature ids are assigned from an atomic counter, so id ordering is not deterministic across runs, but the returned list is sorted. Internally this closes the single-feature writer and opens rayon_threads + 1 per-thread shard files.

`add_feature`¶

Add a single feature that is already projected to [0,1]². No projection is applied — you must pre-normalize coordinates yourself (see CartesianProjector2D).

def add_feature(self, feat: dict) -> int

The feat dict must contain:

Key	Type	Description
`xy`	list[float]	Flat `[x, y, x, y, ...]` in `[0,1]`.
`geom_type`	int	`1` point, `2` linestring, `3` polygon.
`min_x`, `min_y`, `max_x`, `max_y`	float	Geometry bbox in `[0,1]`.
`ring_lengths`	list[int]	Optional — for polygons; `None`/absent ⇒ `[]`.
`tags`	dict	Optional — extracted into feature tags.

Returns: int — the assigned feature id (sequential, starting at 0).

from mudm_tools._rs import StreamingTileGenerator2D, CartesianProjector2D

bounds = (0.0, 0.0, 10000.0, 10000.0)
proj = CartesianProjector2D(bounds)
gen = StreamingTileGenerator2D(min_zoom=0, max_zoom=4)

# A single normalized point at world (2500, 5000)
nx, ny = proj.project(2500.0, 5000.0)
fid = gen.add_feature({
    "xy": [nx, ny],
    "geom_type": 1,
    "min_x": nx, "min_y": ny, "max_x": nx, "max_y": ny,
    "tags": {"layer_type": "markers"},
})

`add_parquet_points`¶

Read point features directly from a Parquet file via arrow-rs (no JSON intermediary). Each row becomes a POINT; coordinates are multiplied by coord_scale, normalized against bounds, clipped, and written with the GIL released.

def add_parquet_points(
    self, path, x_col, y_col, prop_col, prop_name, layer_type, bounds, coord_scale=1.0
) -> int

Parameter	Type	Default	Description
`path`	str	—	Parquet file path.
`x_col`, `y_col`	str	—	Coordinate columns (`Float32` or `Float64`).
`prop_col`	str	—	Source column for a string property; if missing/non-string it is ignored. Supports `StringArray`, `BinaryArray` (utf8-decoded), `LargeStringArray`.
`prop_name`	str	—	Tag key under which the `prop_col` value is stored.
`layer_type`	str	—	Value stored under the fixed tag key `'layer_type'`.
`bounds`	tuple	—	World bbox `(xmin, ymin, xmax, ymax)` after scaling.
`coord_scale`	float	`1.0`	Multiply raw coordinates by this before normalization (e.g. `1 / um_per_px`).

Returns: int — number of point features added.

count = gen.add_parquet_points(
    "transcripts.parquet",
    "x_location", "y_location",   # coordinate columns
    "feature_name", "gene_name",  # prop_col -> output tag key "gene_name"
    "transcripts",                # stored under tag key "layer_type"
    (0.0, 0.0, 8192.0, 8192.0),   # world bounds (after scaling)
    coord_scale=1.0 / 0.2125,     # microns -> pixels
)

Each feature gets the tags [('layer_type', layer_type)] plus ('<prop_name>', value) when prop_col is present. Point decimation applies (same rule as add_geojson).

`add_parquet_polygons`¶

Read polygon features from a Parquet file with a vertex-per-row layout. Rows are grouped by id_col (preserving first-seen order) into rings, scaled, normalized, clipped, and written.

def add_parquet_polygons(
    self, path, id_col, x_col, y_col, layer_type, bounds, coord_scale=1.0
) -> int

Parameter	Type	Default	Description
`path`	str	—	Parquet file path.
`id_col`	str	—	Polygon identifier column. Accepts `Int32`, `Int64`, `String`, `Binary` (utf8), `LargeString` — stringified.
`x_col`, `y_col`	str	—	Vertex coordinate columns (`Float32`/`Float64`).
`layer_type`	str	—	Value for the `'layer_type'` tag.
`bounds`	tuple	—	World bbox `(xmin, ymin, xmax, ymax)` after scaling.
`coord_scale`	float	`1.0`	Multiply raw coords before normalization.

Returns: int — number of polygons added (groups with ≥ 3 vertices).

count = gen.add_parquet_polygons(
    "cell_boundaries.parquet",
    "cell_id", "vertex_x", "vertex_y",  # id + coordinate columns
    "cells",                            # layer_type tag value
    (0.0, 0.0, 8192.0, 8192.0),
    coord_scale=1.0 / 0.2125,
)

Hardcoded tag keys

Tags are [('layer_type', layer_type), ('cell_id', <id string>)]. The polygon id is stored under the hardcoded tag key 'cell_id' regardless of what you pass for id_col. (add_parquet_points likewise stores layer_type under the fixed key 'layer_type'.)

Keep each polygon within one row group

Grouping happens per Arrow record batch. A polygon whose vertex rows straddle two batches is split into separate features. Make sure all rows for a given polygon live within a single row group. Groups with fewer than 3 vertices are dropped. No point decimation applies.

`feature_count_val`¶

def feature_count_val(self) -> int

Returns the number of features added so far (the current feature-id counter). Reflects the atomic counter after bulk methods complete.

Output methods¶

After ingesting, call exactly one of these to emit tiles. Each consumes the writer.

`generate_pbf`¶

Flush fragments, group by tile, encode each tile as an MVT in parallel, write the {z}/{x}/{y}.pbf tree, and write metadata.json (TileJSON 3.0.0).

def generate_pbf(
    self, output_dir, world_bounds, extent=4096, simplify=True, layer_name="geojsonLayer"
) -> int

Parameter	Type	Default	Description
`output_dir`	str	—	Root directory for the `{z}/{x}/{y}.pbf` tree.
`world_bounds`	tuple	—	`(xmin, ymin, xmax, ymax)`.
`extent`	int	`4096`	MVT tile extent.
`simplify`	bool	`True`	Apply Douglas-Peucker (polygons via `simplify_polygon_rings`, linestrings via `douglas_peucker`) at zooms `tz < max_zoom`.
`layer_name`	str	`"geojsonLayer"`	MVT layer id (also the `vector_layers` id in `metadata.json`).

Returns: int — number of .pbf tiles written. Empty tiles are skipped.

n_tiles = gen.generate_pbf("tiles/", bounds, simplify=True, layer_name="features")

The written metadata.json is TileJSON 3.0.0 with: tilejson "3.0.0", tiles ["{z}/{x}/{y}.pbf"], name "MicroJSON Vector Tiles", minzoom/maxzoom, bounds, center [0, cx, cy], vector_layers [{id, fields: {}, minzoom, maxzoom}], and tile_count. See TileJSON reference for the full schema.

Prefer the Python wrapper for keyword-only args

The tiling2d.generate_pbf helper does mkdir -p for you and exposes extent/simplify/layer_name as keyword-only. The Rust method takes them positionally.

`generate_parquet_native`¶

Pure-Rust partitioned Parquet writer. Flushes fragments, transforms to world-coordinate f32 in parallel, and writes one or more part files per zoom under {output_dir}/zoom={z}/part_NNN.parquet.

def generate_parquet_native(
    self, output_dir, world_bounds, simplify=True, compression="zstd"
) -> int

Parameter	Type	Default	Description
`output_dir`	str	—	Root directory; per-zoom subdirs `zoom={z}/` each contain `part_000.parquet`, … (one part per rayon chunk).
`world_bounds`	tuple	—	`(xmin, ymin, xmax, ymax)`.
`simplify`	bool	`True`	Douglas-Peucker at coarse zooms.
`compression`	str	`"zstd"`	One of `"zstd"`, `"lz4"` (→ `LZ4_RAW`), `"snappy"`; any other string ⇒ `UNCOMPRESSED`.

Returns: int — total number of rows written.

This always writes a directory, never a single file

generate_parquet_native always produces a Hive-partitioned directory tree (zoom={z}/part_NNN.parquet) regardless of the name you pass. Naming the path features.parquet/ is misleading — it is a directory, not a single .parquet file. For single-file output, use the Python helper tiling2d.generate_parquet with the default partitioned=False.

Arrow schema (8 columns — NO zoom column, since zoom is the partition):

Column	Arrow type	Notes
`tile_x`	`UInt16`
`tile_y`	`UInt16`
`feature_id`	`UInt32`
`geom_type`	`UInt8`	`1`/`2`/`3`
`positions`	`LargeBinary`	`f32` LE world `x,y` pairs
`indices`	`LargeBinary`	`u32` LE line-segment index pairs; empty for non-linestrings
`ring_lengths`	`List<UInt32>`
`tags`	`Map<Utf8,Utf8>`	all tag values stringified

`generate_all`¶

Read fragments once, then run PBF encoding and native partitioned-Parquet writing concurrently (rayon::join, GIL released). Equivalent to generate_pbf + generate_parquet_native sharing one fragment read. Also writes pbf_dir/metadata.json.

def generate_all(
    self, pbf_dir, parquet_dir, world_bounds,
    extent=4096, simplify=True, layer_name="geojsonLayer", compression="zstd"
) -> tuple[int, int]

Parameter	Type	Default	Description
`pbf_dir`	str	—	Root for the `{z}/{x}/{y}.pbf` tree + `metadata.json`.
`parquet_dir`	str	—	Root for `zoom={z}/part_NNN.parquet`.
`world_bounds`	tuple	—	`(xmin, ymin, xmax, ymax)`.
`extent`	int	`4096`	MVT tile extent.
`simplify`	bool	`True`
`layer_name`	str	`"geojsonLayer"`
`compression`	str	`"zstd"`

Returns: tuple[int, int] — (tile_count, parquet_rows) (u32 PBF tile count, u64 Parquet row count). The Parquet output uses the same 8-column partitioned schema as generate_parquet_native; metadata.json matches generate_pbf.

tile_count, parquet_rows = gen.generate_all(
    "out/vectors", "out/features", bounds, layer_name="features"
)

Internal streaming protocol

The methods _collect_parquet_data, _init_parquet_stream, _next_parquet_batch, and _close_parquet_stream (all leading-underscore) implement the streaming protocol consumed by tiling2d.parquet_writer. They are not intended for direct end-user calls — use tiling2d.generate_parquet instead.

`CartesianProjector2D`¶

Normalizes 2D world coordinates to [0,1]² and back. Use it to pre-normalize coordinates for add_feature, which expects [0,1] input.

from mudm_tools._rs import CartesianProjector2D

proj = CartesianProjector2D((0.0, 0.0, 10000.0, 10000.0))   # (xmin, ymin, xmax, ymax)

nx, ny = proj.project(2500.0, 5000.0)     # world -> normalized [0,1]^2
x, y = proj.unproject(nx, ny)             # normalized -> world

Method	Signature	Returns
`project`	`project(self, x: float, y: float) -> tuple[float, float]`	normalized `(nx, ny)`
`unproject`	`unproject(self, nx: float, ny: float) -> tuple[float, float]`	world `(x, y)`

A degenerate axis (max == min) uses span 1.0, so project returns 0.0 on that axis.

Python helper functions (`mudm_tools.tiling2d`)¶

These wrap the engine's output methods, add the readers, and provide partition-maintenance utilities. All are importable directly from mudm_tools.tiling2d.

from mudm_tools.tiling2d import (
    generate_pbf, read_pbf,
    generate_parquet, read_parquet,
    prime_parquet, deprime_parquet, repartition_parquet,
)

Importing the engine class

tiling2d.__init__ imports StreamingTileGenerator2D and CartesianProjector2D from mudm_tools._rs but does not list them in __all__. from mudm_tools.tiling2d import StreamingTileGenerator2D happens to work, but the canonical, documented import is from mudm_tools._rs import StreamingTileGenerator2D.

`tiling2d.generate_pbf`¶

def generate_pbf(
    generator, output_path, world_bounds, *,
    extent=4096, simplify=True, layer_name="geojsonLayer"
) -> int

Thin wrapper: mkdir -p the output dir, then call generator.generate_pbf(...). Returns the number of tiles written. extent, simplify, and layer_name are keyword-only here (unlike the positional Rust method).

Parameter	Type	Default	Description
`generator`	`StreamingTileGenerator2D`	—	A generator with features added.
`output_path`	str \| Path	—	Directory for the `{z}/{x}/{y}.pbf` tree.
`world_bounds`	tuple	—	`(xmin, ymin, xmax, ymax)`.
`extent`	int	`4096`	MVT extent (kw-only).
`simplify`	bool	`True`	(kw-only)
`layer_name`	str	`"geojsonLayer"`	(kw-only)

`tiling2d.read_pbf`¶

def read_pbf(path, world_bounds, *, zoom=None, tile_x=None, tile_y=None) -> list[dict]

Walk the {z}/{x}/{y}.pbf tree, decode each MVT tile, convert tile-local integers back to world f64 coordinates, and return one dict per feature, sorted by (z, x, y). Returns [] if path is not a directory.

Parameter	Type	Default	Description
`path`	str \| Path	—	Directory containing the `{z}/{x}/{y}.pbf` tree.
`world_bounds`	tuple	—	The `(xmin, ymin, xmax, ymax)` used at generation time.
`zoom`, `tile_x`, `tile_y`	int \| None	`None`	Optional filters (kw-only).

Output dict keys: zoom, tile_x, tile_y, feature_id, geom_type, positions (numpy float32 [N, 2]), ring_lengths, tags.

read_pbf has no indices key

The MVT reader does not reconstruct line-segment indices. read_pbf dicts contain no 'indices' key. If you need indices, read from Parquet instead (see read_parquet).

features = read_pbf("tiles/", bounds, zoom=0)
if features:
    f = features[0]
    print(f["geom_type"], f["positions"].shape, f["tags"])

`tiling2d.generate_parquet`¶

def generate_parquet(
    generator, output_path, world_bounds, *,
    compression="zstd", compression_level=3, batch_size=50_000,
    partitioned=False, max_file_bytes=500_000_000,
    max_batch_bytes=2_000_000_000, simplify=True
) -> int

Writes tiled Parquet from a StreamingTileGenerator2D. It selects one of three paths: in-memory (if the generator lacks _init_parquet_stream), single-file streaming (the default), or partitioned streaming (partitioned=True). Returns rows written.

Parameter	Type	Default	Description
`generator`	`StreamingTileGenerator2D`	—	Generator with fragments added.
`output_path`	str \| Path	—	A single `.parquet` file, or a directory if `partitioned=True`.
`world_bounds`	tuple	—	`(xmin, ymin, xmax, ymax)`.
`compression`	str	`"zstd"`	(kw-only)
`compression_level`	int	`3`	(kw-only)
`batch_size`	int	`50_000`	Fragments per streaming batch (kw-only).
`partitioned`	bool	`False`	One Hive `zoom={z}/part_NNN.parquet` tree (no `zoom` column) vs a single sorted-by-zoom file (with `zoom` column).
`max_file_bytes`	int	`500_000_000`	Rotating part-file size budget (partitioned mode).
`max_batch_bytes`	int	`2_000_000_000`	Per-batch byte budget.
`simplify`	bool	`True`	(kw-only)

# Single file (default): includes a leading `zoom` column
n = generate_parquet(gen, "output.parquet", bounds)

# Hive-partitioned directory: zoom is the partition, no `zoom` column
n = generate_parquet(gen, "output_dir/", bounds, partitioned=True)

Single-file and native are DIFFERENT schemas

tiling2d.generate_parquet (single-file mode, the default) writes a 9-column table that includes a leading zoom UInt8 column. The Rust generate_parquet_native method — and generate_parquet in partitioned=True mode — write an 8-column table with no zoom column (zoom is encoded in the directory name zoom={z}). Do not conflate them. See Output structures below.

`tiling2d.read_parquet`¶

def read_parquet(path, *, zoom=None, feature_id=None, tile_x=None, tile_y=None) -> list[dict]

Reader for tiled 2D Parquet (single file or Hive-partitioned dir). Uses PyArrow dataset predicate pushdown for the optional filters and decodes binary columns into numpy arrays. For a partitioned dir it auto-detects Arrow IPC (.arrow) siblings (the "primed" fast path) vs .parquet.

Parameter	Type	Default	Description
`path`	str \| Path	—	`.parquet` file or partitioned directory.
`zoom`, `feature_id`, `tile_x`, `tile_y`	int \| None	`None`	Optional filters, combined with AND (kw-only).

Output dict keys: zoom, tile_x, tile_y, feature_id, geom_type, positions (np float32 [N, 2]), indices (np uint32 [M]), ring_lengths (list[int]), tags (dict[str, str]).

read_parquet includes indices

Unlike read_pbf, read_parquet dicts do contain an 'indices' key (numpy uint32).

rows = read_parquet("output.parquet", zoom=0)
# Or filter a partitioned dir by tile + feature:
rows = read_parquet("output_dir/", zoom=3, tile_x=2, tile_y=5)

Partition maintenance utilities¶

These operate on a Hive-partitioned Parquet pyramid (zoom={z}/...). All raise FileNotFoundError / NotADirectoryError on a bad path.

`prime_parquet`¶

def prime_parquet(path, *, compression="uncompressed") -> int

Convert each zoom={z}/*.parquet partition file to a sibling Arrow IPC (.arrow) file — the read_parquet fast path. Drops the zoom column if present. Returns the number of .arrow files written. compression (kw-only) is one of {"uncompressed", "lz4", "zstd"} (else ValueError).

`deprime_parquet`¶

def deprime_parquet(path) -> int

Delete all zoom={z}/*.arrow IPC sibling files from a partitioned pyramid. Returns the number deleted.

`repartition_parquet`¶

def repartition_parquet(
    path, *, max_file_bytes=500_000_000, compression="zstd", compression_level=3
) -> dict[int, int]

Split oversized per-zoom partition files into uniformly named part_NNN.parquet files capped at max_file_bytes (uncompressed binary). Skips zoom dirs already correctly named and small. Drops the zoom column, removes .arrow siblings, and writes via temp files then renames. Returns {zoom: num_parts}.

from mudm_tools.tiling2d import prime_parquet, deprime_parquet, repartition_parquet

# Re-balance part files, then add IPC fast-path siblings
repartition_parquet("output_dir/", max_file_bytes=250_000_000)
n_arrow = prime_parquet("output_dir/")          # now read_parquet uses the .arrow files
# ... later, to reclaim space:
deprime_parquet("output_dir/")

Output structures¶

PBF (MVT) — `generate_pbf` / `generate_all`¶

output_dir/
  metadata.json          # TileJSON 3.0.0 (tilejson, tiles=["{z}/{x}/{y}.pbf"],
                         #   minzoom, maxzoom, bounds, center, vector_layers, tile_count)
  {z}/
    {x}/
      {y}.pbf            # one MVT per non-empty tile; layer id = layer_name
                         #   (default "geojsonLayer"); MVT extent = extent (default 4096)

Tiled Parquet — partitioned (8-col, no `zoom` column)¶

Produced by the Rust generate_parquet_native, by generate_all, and by generate_parquet(partitioned=True):

output_dir/
  zoom=0/
    part_000.parquet
    part_001.parquet     # native: one part per rayon chunk;
                         #   python partitioned: rotated by max_file_bytes
  zoom=1/
    part_000.parquet
  ...

Columns: tile_x(u16), tile_y(u16), feature_id(u32), geom_type(u8), positions(large_binary, f32 LE world x,y pairs), indices(large_binary, u32 LE seg-index pairs; empty for non-linestrings), ring_lengths(list<u32>), tags(map<utf8,utf8>).

Tiled Parquet — single file (9-col, with `zoom` column)¶

Produced by generate_parquet(partitioned=False), the default. One file, sorted by zoom (one row group per zoom):

output.parquet           # single file

Same columns as above plus a leading zoom(u8) column.

Primed fast path¶

prime_parquet adds sibling Arrow IPC files inside each partition:

output_dir/
  zoom=0/
    part_000.parquet
    part_000.arrow       # written by prime_parquet; removed by deprime_parquet

End-to-end example¶

The repository ships a complete, runnable script at src/mudm_tools/examples/tiling_rust.py. It generates random polygons with polygen, builds a StreamingTileGenerator2D, adds them with add_geojson, then writes and reads either Parquet or PBF.

# Default: single-file Parquet at tiles_2d.parquet, max-zoom 7
uv run python -m mudm_tools.examples.tiling_rust

# Hive-partitioned Parquet, lower max-zoom
uv run python -m mudm_tools.examples.tiling_rust --max-zoom 6 --partitioned

# PBF vector tiles (writes to tiles/ when the output path ends in .parquet)
uv run python -m mudm_tools.examples.tiling_rust my_data.json --pbf

The script accepts --min-zoom, --max-zoom (default 7), --output, --partitioned, --pbf, --no-simplify, --buffer (pixels at extent 4096, default 64), --grid-size, and --cell-size. It converts the pixel buffer to normalized space with buffer = args.buffer / 4096.0 before constructing the generator.

Run modules, not file paths

Run the example as a module: python -m mudm_tools.examples.tiling_rust. The script's own docstring shows a stale src/mudm/examples/... path — the real location is src/mudm_tools/examples/tiling_rust.py.

API reference (autodoc)¶

tiling2d ¶

2D vector tile generation and reading for MuDM.

Provides quadtree-based spatial indexing and a full pipeline from GeoJSON features to tiled Parquet output for ML training.

deprime_parquet ¶

deprime_parquet(path: str | Path) -> int

Remove all Arrow IPC siblings from a partitioned Parquet pyramid.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory of a Hive-partitioned Parquet pyramid.	required

Returns:

Type	Description
`int`	Number of Arrow IPC files deleted.

Source code in src/mudm_tools/tiling2d/parquet_prime.py

def deprime_parquet(path: str | Path) -> int:
    """Remove all Arrow IPC siblings from a partitioned Parquet pyramid.

    Args:
        path: Root directory of a Hive-partitioned Parquet pyramid.

    Returns:
        Number of Arrow IPC files deleted.
    """
    root = Path(path)
    if not root.exists():
        raise FileNotFoundError(f"Path does not exist: {root}")
    if not root.is_dir():
        raise NotADirectoryError(f"Path is not a directory: {root}")

    count = 0
    for arrow_file in sorted(root.glob("zoom=*/*.arrow")):
        arrow_file.unlink()
        count += 1

    return count

prime_parquet ¶

prime_parquet(
    path: str | Path, *, compression: str = "uncompressed"
) -> int

Convert each partition's Parquet files to sibling Arrow IPC files.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory of a Hive-partitioned Parquet pyramid.	required
`compression`	`str`	Arrow IPC compression (default "uncompressed").	`'uncompressed'`

Returns:

Type	Description
`int`	Number of Arrow IPC files written.

Source code in src/mudm_tools/tiling2d/parquet_prime.py

def prime_parquet(
    path: str | Path,
    *,
    compression: str = "uncompressed",
) -> int:
    """Convert each partition's Parquet files to sibling Arrow IPC files.

    Args:
        path: Root directory of a Hive-partitioned Parquet pyramid.
        compression: Arrow IPC compression (default "uncompressed").

    Returns:
        Number of Arrow IPC files written.
    """
    allowed = {"uncompressed", "lz4", "zstd"}
    if compression not in allowed:
        raise ValueError(f"compression must be one of {sorted(allowed)}, got {compression!r}")

    root = Path(path)
    if not root.exists():
        raise FileNotFoundError(f"Path does not exist: {root}")
    if not root.is_dir():
        raise NotADirectoryError(f"Path is not a directory: {root}")

    count = 0
    for pq_file in sorted(root.glob("zoom=*/*.parquet")):
        table = pq.read_table(str(pq_file))
        if "zoom" in table.column_names:
            table = table.drop_columns(["zoom"])
        arrow_file = pq_file.with_suffix(".arrow")
        feather.write_feather(table, str(arrow_file), compression=compression)
        count += 1

    return count

repartition_parquet ¶

repartition_parquet(
    path: str | Path,
    *,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    compression: str = "zstd",
    compression_level: int = 3,
) -> dict[int, int]

Split oversized partition files into smaller part_NNN.parquet files.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory of a Hive-partitioned Parquet pyramid.	required
`max_file_bytes`	`int`	Maximum uncompressed binary bytes per output file.	`_DEFAULT_MAX_FILE_BYTES`
`compression`	`str`	Parquet compression codec (default "zstd").	`'zstd'`
`compression_level`	`int`	Compression level (default 3).	`3`

Returns:

Type	Description
`dict[int, int]`	Dict mapping zoom level to number of output parts.

Source code in src/mudm_tools/tiling2d/parquet_prime.py

def repartition_parquet(
    path: str | Path,
    *,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    compression: str = "zstd",
    compression_level: int = 3,
) -> dict[int, int]:
    """Split oversized partition files into smaller ``part_NNN.parquet`` files.

    Args:
        path: Root directory of a Hive-partitioned Parquet pyramid.
        max_file_bytes: Maximum uncompressed binary bytes per output file.
        compression: Parquet compression codec (default "zstd").
        compression_level: Compression level (default 3).

    Returns:
        Dict mapping zoom level to number of output parts.
    """
    root = Path(path)
    if not root.exists():
        raise FileNotFoundError(f"Path does not exist: {root}")
    if not root.is_dir():
        raise NotADirectoryError(f"Path is not a directory: {root}")

    result: dict[int, int] = {}

    for zoom_dir in sorted(root.glob("zoom=*")):
        if not zoom_dir.is_dir():
            continue
        m = re.fullmatch(r"zoom=(\d+)", zoom_dir.name)
        if not m:
            continue
        zoom = int(m.group(1))

        pq_files = sorted(zoom_dir.glob("*.parquet"))
        if not pq_files:
            continue

        all_named_ok = all(re.fullmatch(r"part_\d{3}\.parquet", f.name) for f in pq_files)
        all_small = all(f.stat().st_size <= max_file_bytes for f in pq_files)
        if all_named_ok and all_small:
            result[zoom] = len(pq_files)
            continue

        tables = []
        for f in pq_files:
            tables.append(pq.read_table(str(f)))
        table = pa.concat_tables(tables)
        if "zoom" in table.column_names:
            table = table.drop_columns(["zoom"])

        kwargs: dict = {"compression": compression}
        if compression not in ("none", "NONE", None):
            kwargs["compression_level"] = compression_level

        table = table.combine_chunks()
        schema = table.schema
        part_idx = 0
        cum_bytes = 0
        writer: pq.ParquetWriter | None = None
        tmp_files: list[Path] = []

        def _open_writer() -> pq.ParquetWriter:
            nonlocal part_idx
            tmp_path = zoom_dir / f".tmp_part_{part_idx:03d}.parquet"
            tmp_files.append(tmp_path)
            return pq.ParquetWriter(str(tmp_path), schema, **kwargs)

        total_binary = _estimate_binary_bytes(table)
        if table.num_rows > 0 and total_binary > 0:
            bytes_per_row = total_binary / table.num_rows
            rows_per_file = max(1, int(max_file_bytes / bytes_per_row))
            chunk_size = max(1, rows_per_file // 2)
        else:
            chunk_size = max(1, table.num_rows)

        for start in range(0, table.num_rows, chunk_size):
            chunk = table.slice(start, min(chunk_size, table.num_rows - start))
            chunk_bytes = _estimate_binary_bytes(chunk)

            if writer is not None and cum_bytes > 0 and cum_bytes + chunk_bytes > max_file_bytes:
                writer.close()
                part_idx += 1
                cum_bytes = 0
                writer = None

            if writer is None:
                writer = _open_writer()

            writer.write_table(chunk)
            cum_bytes += chunk_bytes

        if writer is not None:
            writer.close()

        for f in pq_files:
            f.unlink()
        for f in zoom_dir.glob("*.arrow"):
            f.unlink()

        for tmp_path in tmp_files:
            final_name = tmp_path.name.replace(".tmp_", "")
            tmp_path.rename(zoom_dir / final_name)

        result[zoom] = part_idx + 1

    return result

read_parquet ¶

read_parquet(
    path: str | Path,
    *,
    zoom: int | None = None,
    feature_id: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]

Read rows from a tiled 2D Parquet file.

Returns a list of dicts with

zoom, tile_x, tile_y, feature_id, geom_type, positions (np.float32 [N,2]), indices (np.uint32 [M]), ring_lengths (list[int]), tags (dict[str, str]).

Uses PyArrow predicate pushdown for efficient filtering.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the .parquet file or partitioned directory.	required
`zoom`	`int \| None`	Filter to this zoom level.	`None`
`feature_id`	`int \| None`	Filter to this feature ID.	`None`
`tile_x`	`int \| None`	Filter to this tile X coordinate.	`None`
`tile_y`	`int \| None`	Filter to this tile Y coordinate.	`None`

Source code in src/mudm_tools/tiling2d/parquet_reader.py

def read_parquet(
    path: str | Path,
    *,
    zoom: int | None = None,
    feature_id: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]:
    """Read rows from a tiled 2D Parquet file.

    Returns a list of dicts with:
        zoom, tile_x, tile_y, feature_id, geom_type,
        positions (np.float32 [N,2]), indices (np.uint32 [M]),
        ring_lengths (list[int]), tags (dict[str, str]).

    Uses PyArrow predicate pushdown for efficient filtering.

    Args:
        path: Path to the .parquet file or partitioned directory.
        zoom: Filter to this zoom level.
        feature_id: Filter to this feature ID.
        tile_x: Filter to this tile X coordinate.
        tile_y: Filter to this tile Y coordinate.
    """
    import pyarrow as pa
    import pyarrow.compute as pc
    import pyarrow.dataset as ds

    path_obj = Path(path)
    if path_obj.is_dir():
        fmt = _detect_format(path_obj)
        if fmt == "ipc":
            arrow_files = sorted(str(f) for f in path_obj.glob("zoom=*/*.arrow"))
            partitioning = ds.HivePartitioning(pa.schema([("zoom", pa.int32())]))
            dataset = ds.dataset(
                arrow_files,
                format="ipc",
                partitioning=partitioning,
                partition_base_dir=str(path_obj),
            )
        else:
            dataset = ds.dataset(str(path), format="parquet", partitioning="hive")
    else:
        dataset = ds.dataset(str(path), format="parquet")

    filters = []
    if zoom is not None:
        filters.append(pc.field("zoom") == zoom)
    if feature_id is not None:
        filters.append(pc.field("feature_id") == feature_id)
    if tile_x is not None:
        filters.append(pc.field("tile_x") == tile_x)
    if tile_y is not None:
        filters.append(pc.field("tile_y") == tile_y)

    combined = None
    for f in filters:
        combined = f if combined is None else (combined & f)

    table = dataset.to_table(filter=combined)

    rows = []
    for i in range(table.num_rows):
        pos_bytes = table.column("positions")[i].as_py()
        idx_bytes = table.column("indices")[i].as_py()

        positions = np.frombuffer(pos_bytes, dtype=np.float32).reshape(-1, 2)
        indices = np.frombuffer(idx_bytes, dtype=np.uint32)

        # ring_lengths: list<uint32>
        rl_val = table.column("ring_lengths")[i].as_py()
        ring_lengths = list(rl_val) if rl_val else []

        tag_map = table.column("tags")[i].as_py()
        tags = dict(tag_map) if tag_map else {}

        rows.append(
            {
                "zoom": table.column("zoom")[i].as_py(),
                "tile_x": table.column("tile_x")[i].as_py(),
                "tile_y": table.column("tile_y")[i].as_py(),
                "feature_id": table.column("feature_id")[i].as_py(),
                "geom_type": table.column("geom_type")[i].as_py(),
                "positions": positions,
                "indices": indices,
                "ring_lengths": ring_lengths,
                "tags": tags,
            }
        )

    return rows

generate_parquet ¶

generate_parquet(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    compression: str = "zstd",
    compression_level: int = 3,
    batch_size: int = 50000,
    partitioned: bool = False,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    max_batch_bytes: int = _DEFAULT_MAX_BATCH_BYTES,
    simplify: bool = True,
) -> int

Generate a Parquet file from a StreamingTileGenerator2D.

Parameters:

Name	Type	Description	Default
`generator`		A StreamingTileGenerator2D with fragments already added.	required
`output_path`	`str \| Path`	Path for the output .parquet file (or directory if partitioned).	required
`world_bounds`	`tuple[float, float, float, float]`	World bounding box (xmin, ymin, xmax, ymax).	required
`compression`	`str`	Parquet compression codec (default "zstd").	`'zstd'`
`compression_level`	`int`	Compression level (default 3).	`3`
`batch_size`	`int`	Number of fragments to process per batch (streaming mode).	`50000`
`partitioned`	`bool`	If True, write partitioned output (one file per zoom level).	`False`
`max_batch_bytes`	`int`	Byte budget per batch (default 2 GB).	`_DEFAULT_MAX_BATCH_BYTES`
`simplify`	`bool`	If True (default), apply Douglas-Peucker simplification at coarse zoom levels for polygons and linestrings.	`True`

Returns:

Type	Description
`int`	Number of rows written.

Source code in src/mudm_tools/tiling2d/parquet_writer.py

def generate_parquet(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    compression: str = "zstd",
    compression_level: int = 3,
    batch_size: int = 50_000,
    partitioned: bool = False,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    max_batch_bytes: int = _DEFAULT_MAX_BATCH_BYTES,
    simplify: bool = True,
) -> int:
    """Generate a Parquet file from a StreamingTileGenerator2D.

    Args:
        generator: A StreamingTileGenerator2D with fragments already added.
        output_path: Path for the output .parquet file (or directory if partitioned).
        world_bounds: World bounding box (xmin, ymin, xmax, ymax).
        compression: Parquet compression codec (default "zstd").
        compression_level: Compression level (default 3).
        batch_size: Number of fragments to process per batch (streaming mode).
        partitioned: If True, write partitioned output (one file per zoom level).
        max_batch_bytes: Byte budget per batch (default 2 GB).
        simplify: If True (default), apply Douglas-Peucker simplification at
            coarse zoom levels for polygons and linestrings.

    Returns:
        Number of rows written.
    """
    has_streaming = hasattr(generator, "_init_parquet_stream")

    if not has_streaming:
        return _generate_parquet_inmemory(
            generator,
            output_path,
            world_bounds,
            compression=compression,
            compression_level=compression_level,
            simplify=simplify,
        )

    if partitioned:
        return _generate_parquet_partitioned_streaming(
            generator,
            output_path,
            world_bounds,
            compression=compression,
            compression_level=compression_level,
            batch_size=batch_size,
            max_file_bytes=max_file_bytes,
            max_batch_bytes=max_batch_bytes,
            simplify=simplify,
        )

    return _generate_parquet_single_streaming(
        generator,
        output_path,
        world_bounds,
        compression=compression,
        compression_level=compression_level,
        batch_size=batch_size,
        max_batch_bytes=max_batch_bytes,
        simplify=simplify,
    )

read_pbf ¶

read_pbf(
    path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    zoom: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]

Read PBF tiles back to feature dicts.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Directory containing `{z}/{x}/{y}.pbf` tile tree.	required
`world_bounds`	`tuple[float, float, float, float]`	`(xmin, ymin, xmax, ymax)` used during tile generation.	required
`zoom`	`int \| None`	Filter to a specific zoom level.	`None`
`tile_x`	`int \| None`	Filter to a specific tile X.	`None`
`tile_y`	`int \| None`	Filter to a specific tile Y.	`None`

Returns:

Type	Description
`list[dict]`	List of dicts with keys: zoom, tile_x, tile_y, feature_id,
`list[dict]`	geom_type, positions (numpy float32 array), ring_lengths, tags.

Source code in src/mudm_tools/tiling2d/pbf_reader.py

def read_pbf(
    path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    zoom: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]:
    """Read PBF tiles back to feature dicts.

    Args:
        path: Directory containing ``{z}/{x}/{y}.pbf`` tile tree.
        world_bounds: ``(xmin, ymin, xmax, ymax)`` used during tile generation.
        zoom: Filter to a specific zoom level.
        tile_x: Filter to a specific tile X.
        tile_y: Filter to a specific tile Y.

    Returns:
        List of dicts with keys: zoom, tile_x, tile_y, feature_id,
        geom_type, positions (numpy float32 array), ring_lengths, tags.
    """
    from mudm_tools._rs import read_pbf as _read_pbf

    return _read_pbf(str(path), world_bounds, zoom, tile_x, tile_y)

generate_pbf ¶

generate_pbf(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    extent: int = 4096,
    simplify: bool = True,
    layer_name: str = "geojsonLayer",
) -> int

Generate PBF vector tiles from a StreamingTileGenerator2D.

Parameters:

Name	Type	Description	Default
`generator`		A `StreamingTileGenerator2D` instance with features added.	required
`output_path`	`str \| Path`	Directory to write tiles into ({z}/{x}/{y}.pbf).	required
`world_bounds`	`tuple[float, float, float, float]`	`(xmin, ymin, xmax, ymax)` in world coordinates.	required
`extent`	`int`	MVT tile extent (default 4096).	`4096`
`simplify`	`bool`	Whether to apply Douglas-Peucker simplification at coarse zooms.	`True`
`layer_name`	`str`	MVT layer name (default "geojsonLayer").	`'geojsonLayer'`

Returns:

Type	Description
`int`	Number of tiles written.

Source code in src/mudm_tools/tiling2d/pbf_writer.py

def generate_pbf(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    extent: int = 4096,
    simplify: bool = True,
    layer_name: str = "geojsonLayer",
) -> int:
    """Generate PBF vector tiles from a StreamingTileGenerator2D.

    Args:
        generator: A ``StreamingTileGenerator2D`` instance with features added.
        output_path: Directory to write tiles into ({z}/{x}/{y}.pbf).
        world_bounds: ``(xmin, ymin, xmax, ymax)`` in world coordinates.
        extent: MVT tile extent (default 4096).
        simplify: Whether to apply Douglas-Peucker simplification at coarse zooms.
        layer_name: MVT layer name (default "geojsonLayer").

    Returns:
        Number of tiles written.
    """
    out = Path(output_path)
    out.mkdir(parents=True, exist_ok=True)
    return generator.generate_pbf(
        str(out),
        world_bounds,
        extent,
        simplify,
        layer_name,
    )

generate_pbf ¶

generate_pbf(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    extent: int = 4096,
    simplify: bool = True,
    layer_name: str = "geojsonLayer",
) -> int

Generate PBF vector tiles from a StreamingTileGenerator2D.

Parameters:

Name	Type	Description	Default
`generator`		A `StreamingTileGenerator2D` instance with features added.	required
`output_path`	`str \| Path`	Directory to write tiles into ({z}/{x}/{y}.pbf).	required
`world_bounds`	`tuple[float, float, float, float]`	`(xmin, ymin, xmax, ymax)` in world coordinates.	required
`extent`	`int`	MVT tile extent (default 4096).	`4096`
`simplify`	`bool`	Whether to apply Douglas-Peucker simplification at coarse zooms.	`True`
`layer_name`	`str`	MVT layer name (default "geojsonLayer").	`'geojsonLayer'`

Returns:

Type	Description
`int`	Number of tiles written.

Source code in src/mudm_tools/tiling2d/pbf_writer.py

def generate_pbf(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    extent: int = 4096,
    simplify: bool = True,
    layer_name: str = "geojsonLayer",
) -> int:
    """Generate PBF vector tiles from a StreamingTileGenerator2D.

    Args:
        generator: A ``StreamingTileGenerator2D`` instance with features added.
        output_path: Directory to write tiles into ({z}/{x}/{y}.pbf).
        world_bounds: ``(xmin, ymin, xmax, ymax)`` in world coordinates.
        extent: MVT tile extent (default 4096).
        simplify: Whether to apply Douglas-Peucker simplification at coarse zooms.
        layer_name: MVT layer name (default "geojsonLayer").

    Returns:
        Number of tiles written.
    """
    out = Path(output_path)
    out.mkdir(parents=True, exist_ok=True)
    return generator.generate_pbf(
        str(out),
        world_bounds,
        extent,
        simplify,
        layer_name,
    )

read_pbf ¶

read_pbf(
    path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    zoom: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]

Read PBF tiles back to feature dicts.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Directory containing `{z}/{x}/{y}.pbf` tile tree.	required
`world_bounds`	`tuple[float, float, float, float]`	`(xmin, ymin, xmax, ymax)` used during tile generation.	required
`zoom`	`int \| None`	Filter to a specific zoom level.	`None`
`tile_x`	`int \| None`	Filter to a specific tile X.	`None`
`tile_y`	`int \| None`	Filter to a specific tile Y.	`None`

Returns:

Type	Description
`list[dict]`	List of dicts with keys: zoom, tile_x, tile_y, feature_id,
`list[dict]`	geom_type, positions (numpy float32 array), ring_lengths, tags.

Source code in src/mudm_tools/tiling2d/pbf_reader.py

def read_pbf(
    path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    zoom: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]:
    """Read PBF tiles back to feature dicts.

    Args:
        path: Directory containing ``{z}/{x}/{y}.pbf`` tile tree.
        world_bounds: ``(xmin, ymin, xmax, ymax)`` used during tile generation.
        zoom: Filter to a specific zoom level.
        tile_x: Filter to a specific tile X.
        tile_y: Filter to a specific tile Y.

    Returns:
        List of dicts with keys: zoom, tile_x, tile_y, feature_id,
        geom_type, positions (numpy float32 array), ring_lengths, tags.
    """
    from mudm_tools._rs import read_pbf as _read_pbf

    return _read_pbf(str(path), world_bounds, zoom, tile_x, tile_y)

generate_parquet ¶

generate_parquet(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    compression: str = "zstd",
    compression_level: int = 3,
    batch_size: int = 50000,
    partitioned: bool = False,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    max_batch_bytes: int = _DEFAULT_MAX_BATCH_BYTES,
    simplify: bool = True,
) -> int

Generate a Parquet file from a StreamingTileGenerator2D.

Parameters:

Name	Type	Description	Default
`generator`		A StreamingTileGenerator2D with fragments already added.	required
`output_path`	`str \| Path`	Path for the output .parquet file (or directory if partitioned).	required
`world_bounds`	`tuple[float, float, float, float]`	World bounding box (xmin, ymin, xmax, ymax).	required
`compression`	`str`	Parquet compression codec (default "zstd").	`'zstd'`
`compression_level`	`int`	Compression level (default 3).	`3`
`batch_size`	`int`	Number of fragments to process per batch (streaming mode).	`50000`
`partitioned`	`bool`	If True, write partitioned output (one file per zoom level).	`False`
`max_batch_bytes`	`int`	Byte budget per batch (default 2 GB).	`_DEFAULT_MAX_BATCH_BYTES`
`simplify`	`bool`	If True (default), apply Douglas-Peucker simplification at coarse zoom levels for polygons and linestrings.	`True`

Returns:

Type	Description
`int`	Number of rows written.

Source code in src/mudm_tools/tiling2d/parquet_writer.py

def generate_parquet(
    generator,
    output_path: str | Path,
    world_bounds: tuple[float, float, float, float],
    *,
    compression: str = "zstd",
    compression_level: int = 3,
    batch_size: int = 50_000,
    partitioned: bool = False,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    max_batch_bytes: int = _DEFAULT_MAX_BATCH_BYTES,
    simplify: bool = True,
) -> int:
    """Generate a Parquet file from a StreamingTileGenerator2D.

    Args:
        generator: A StreamingTileGenerator2D with fragments already added.
        output_path: Path for the output .parquet file (or directory if partitioned).
        world_bounds: World bounding box (xmin, ymin, xmax, ymax).
        compression: Parquet compression codec (default "zstd").
        compression_level: Compression level (default 3).
        batch_size: Number of fragments to process per batch (streaming mode).
        partitioned: If True, write partitioned output (one file per zoom level).
        max_batch_bytes: Byte budget per batch (default 2 GB).
        simplify: If True (default), apply Douglas-Peucker simplification at
            coarse zoom levels for polygons and linestrings.

    Returns:
        Number of rows written.
    """
    has_streaming = hasattr(generator, "_init_parquet_stream")

    if not has_streaming:
        return _generate_parquet_inmemory(
            generator,
            output_path,
            world_bounds,
            compression=compression,
            compression_level=compression_level,
            simplify=simplify,
        )

    if partitioned:
        return _generate_parquet_partitioned_streaming(
            generator,
            output_path,
            world_bounds,
            compression=compression,
            compression_level=compression_level,
            batch_size=batch_size,
            max_file_bytes=max_file_bytes,
            max_batch_bytes=max_batch_bytes,
            simplify=simplify,
        )

    return _generate_parquet_single_streaming(
        generator,
        output_path,
        world_bounds,
        compression=compression,
        compression_level=compression_level,
        batch_size=batch_size,
        max_batch_bytes=max_batch_bytes,
        simplify=simplify,
    )

read_parquet ¶

read_parquet(
    path: str | Path,
    *,
    zoom: int | None = None,
    feature_id: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]

Read rows from a tiled 2D Parquet file.

Returns a list of dicts with

zoom, tile_x, tile_y, feature_id, geom_type, positions (np.float32 [N,2]), indices (np.uint32 [M]), ring_lengths (list[int]), tags (dict[str, str]).

Uses PyArrow predicate pushdown for efficient filtering.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the .parquet file or partitioned directory.	required
`zoom`	`int \| None`	Filter to this zoom level.	`None`
`feature_id`	`int \| None`	Filter to this feature ID.	`None`
`tile_x`	`int \| None`	Filter to this tile X coordinate.	`None`
`tile_y`	`int \| None`	Filter to this tile Y coordinate.	`None`

Source code in src/mudm_tools/tiling2d/parquet_reader.py

def read_parquet(
    path: str | Path,
    *,
    zoom: int | None = None,
    feature_id: int | None = None,
    tile_x: int | None = None,
    tile_y: int | None = None,
) -> list[dict]:
    """Read rows from a tiled 2D Parquet file.

    Returns a list of dicts with:
        zoom, tile_x, tile_y, feature_id, geom_type,
        positions (np.float32 [N,2]), indices (np.uint32 [M]),
        ring_lengths (list[int]), tags (dict[str, str]).

    Uses PyArrow predicate pushdown for efficient filtering.

    Args:
        path: Path to the .parquet file or partitioned directory.
        zoom: Filter to this zoom level.
        feature_id: Filter to this feature ID.
        tile_x: Filter to this tile X coordinate.
        tile_y: Filter to this tile Y coordinate.
    """
    import pyarrow as pa
    import pyarrow.compute as pc
    import pyarrow.dataset as ds

    path_obj = Path(path)
    if path_obj.is_dir():
        fmt = _detect_format(path_obj)
        if fmt == "ipc":
            arrow_files = sorted(str(f) for f in path_obj.glob("zoom=*/*.arrow"))
            partitioning = ds.HivePartitioning(pa.schema([("zoom", pa.int32())]))
            dataset = ds.dataset(
                arrow_files,
                format="ipc",
                partitioning=partitioning,
                partition_base_dir=str(path_obj),
            )
        else:
            dataset = ds.dataset(str(path), format="parquet", partitioning="hive")
    else:
        dataset = ds.dataset(str(path), format="parquet")

    filters = []
    if zoom is not None:
        filters.append(pc.field("zoom") == zoom)
    if feature_id is not None:
        filters.append(pc.field("feature_id") == feature_id)
    if tile_x is not None:
        filters.append(pc.field("tile_x") == tile_x)
    if tile_y is not None:
        filters.append(pc.field("tile_y") == tile_y)

    combined = None
    for f in filters:
        combined = f if combined is None else (combined & f)

    table = dataset.to_table(filter=combined)

    rows = []
    for i in range(table.num_rows):
        pos_bytes = table.column("positions")[i].as_py()
        idx_bytes = table.column("indices")[i].as_py()

        positions = np.frombuffer(pos_bytes, dtype=np.float32).reshape(-1, 2)
        indices = np.frombuffer(idx_bytes, dtype=np.uint32)

        # ring_lengths: list<uint32>
        rl_val = table.column("ring_lengths")[i].as_py()
        ring_lengths = list(rl_val) if rl_val else []

        tag_map = table.column("tags")[i].as_py()
        tags = dict(tag_map) if tag_map else {}

        rows.append(
            {
                "zoom": table.column("zoom")[i].as_py(),
                "tile_x": table.column("tile_x")[i].as_py(),
                "tile_y": table.column("tile_y")[i].as_py(),
                "feature_id": table.column("feature_id")[i].as_py(),
                "geom_type": table.column("geom_type")[i].as_py(),
                "positions": positions,
                "indices": indices,
                "ring_lengths": ring_lengths,
                "tags": tags,
            }
        )

    return rows

prime_parquet ¶

prime_parquet(
    path: str | Path, *, compression: str = "uncompressed"
) -> int

Convert each partition's Parquet files to sibling Arrow IPC files.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory of a Hive-partitioned Parquet pyramid.	required
`compression`	`str`	Arrow IPC compression (default "uncompressed").	`'uncompressed'`

Returns:

Type	Description
`int`	Number of Arrow IPC files written.

Source code in src/mudm_tools/tiling2d/parquet_prime.py

def prime_parquet(
    path: str | Path,
    *,
    compression: str = "uncompressed",
) -> int:
    """Convert each partition's Parquet files to sibling Arrow IPC files.

    Args:
        path: Root directory of a Hive-partitioned Parquet pyramid.
        compression: Arrow IPC compression (default "uncompressed").

    Returns:
        Number of Arrow IPC files written.
    """
    allowed = {"uncompressed", "lz4", "zstd"}
    if compression not in allowed:
        raise ValueError(f"compression must be one of {sorted(allowed)}, got {compression!r}")

    root = Path(path)
    if not root.exists():
        raise FileNotFoundError(f"Path does not exist: {root}")
    if not root.is_dir():
        raise NotADirectoryError(f"Path is not a directory: {root}")

    count = 0
    for pq_file in sorted(root.glob("zoom=*/*.parquet")):
        table = pq.read_table(str(pq_file))
        if "zoom" in table.column_names:
            table = table.drop_columns(["zoom"])
        arrow_file = pq_file.with_suffix(".arrow")
        feather.write_feather(table, str(arrow_file), compression=compression)
        count += 1

    return count

deprime_parquet ¶

deprime_parquet(path: str | Path) -> int

Remove all Arrow IPC siblings from a partitioned Parquet pyramid.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory of a Hive-partitioned Parquet pyramid.	required

Returns:

Type	Description
`int`	Number of Arrow IPC files deleted.

Source code in src/mudm_tools/tiling2d/parquet_prime.py

def deprime_parquet(path: str | Path) -> int:
    """Remove all Arrow IPC siblings from a partitioned Parquet pyramid.

    Args:
        path: Root directory of a Hive-partitioned Parquet pyramid.

    Returns:
        Number of Arrow IPC files deleted.
    """
    root = Path(path)
    if not root.exists():
        raise FileNotFoundError(f"Path does not exist: {root}")
    if not root.is_dir():
        raise NotADirectoryError(f"Path is not a directory: {root}")

    count = 0
    for arrow_file in sorted(root.glob("zoom=*/*.arrow")):
        arrow_file.unlink()
        count += 1

    return count

repartition_parquet ¶

repartition_parquet(
    path: str | Path,
    *,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    compression: str = "zstd",
    compression_level: int = 3,
) -> dict[int, int]

Split oversized partition files into smaller part_NNN.parquet files.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Root directory of a Hive-partitioned Parquet pyramid.	required
`max_file_bytes`	`int`	Maximum uncompressed binary bytes per output file.	`_DEFAULT_MAX_FILE_BYTES`
`compression`	`str`	Parquet compression codec (default "zstd").	`'zstd'`
`compression_level`	`int`	Compression level (default 3).	`3`

Returns:

Type	Description
`dict[int, int]`	Dict mapping zoom level to number of output parts.

Source code in src/mudm_tools/tiling2d/parquet_prime.py

def repartition_parquet(
    path: str | Path,
    *,
    max_file_bytes: int = _DEFAULT_MAX_FILE_BYTES,
    compression: str = "zstd",
    compression_level: int = 3,
) -> dict[int, int]:
    """Split oversized partition files into smaller ``part_NNN.parquet`` files.

    Args:
        path: Root directory of a Hive-partitioned Parquet pyramid.
        max_file_bytes: Maximum uncompressed binary bytes per output file.
        compression: Parquet compression codec (default "zstd").
        compression_level: Compression level (default 3).

    Returns:
        Dict mapping zoom level to number of output parts.
    """
    root = Path(path)
    if not root.exists():
        raise FileNotFoundError(f"Path does not exist: {root}")
    if not root.is_dir():
        raise NotADirectoryError(f"Path is not a directory: {root}")

    result: dict[int, int] = {}

    for zoom_dir in sorted(root.glob("zoom=*")):
        if not zoom_dir.is_dir():
            continue
        m = re.fullmatch(r"zoom=(\d+)", zoom_dir.name)
        if not m:
            continue
        zoom = int(m.group(1))

        pq_files = sorted(zoom_dir.glob("*.parquet"))
        if not pq_files:
            continue

        all_named_ok = all(re.fullmatch(r"part_\d{3}\.parquet", f.name) for f in pq_files)
        all_small = all(f.stat().st_size <= max_file_bytes for f in pq_files)
        if all_named_ok and all_small:
            result[zoom] = len(pq_files)
            continue

        tables = []
        for f in pq_files:
            tables.append(pq.read_table(str(f)))
        table = pa.concat_tables(tables)
        if "zoom" in table.column_names:
            table = table.drop_columns(["zoom"])

        kwargs: dict = {"compression": compression}
        if compression not in ("none", "NONE", None):
            kwargs["compression_level"] = compression_level

        table = table.combine_chunks()
        schema = table.schema
        part_idx = 0
        cum_bytes = 0
        writer: pq.ParquetWriter | None = None
        tmp_files: list[Path] = []

        def _open_writer() -> pq.ParquetWriter:
            nonlocal part_idx
            tmp_path = zoom_dir / f".tmp_part_{part_idx:03d}.parquet"
            tmp_files.append(tmp_path)
            return pq.ParquetWriter(str(tmp_path), schema, **kwargs)

        total_binary = _estimate_binary_bytes(table)
        if table.num_rows > 0 and total_binary > 0:
            bytes_per_row = total_binary / table.num_rows
            rows_per_file = max(1, int(max_file_bytes / bytes_per_row))
            chunk_size = max(1, rows_per_file // 2)
        else:
            chunk_size = max(1, table.num_rows)

        for start in range(0, table.num_rows, chunk_size):
            chunk = table.slice(start, min(chunk_size, table.num_rows - start))
            chunk_bytes = _estimate_binary_bytes(chunk)

            if writer is not None and cum_bytes > 0 and cum_bytes + chunk_bytes > max_file_bytes:
                writer.close()
                part_idx += 1
                cum_bytes = 0
                writer = None

            if writer is None:
                writer = _open_writer()

            writer.write_table(chunk)
            cum_bytes += chunk_bytes

        if writer is not None:
            writer.close()

        for f in pq_files:
            f.unlink()
        for f in zoom_dir.glob("*.arrow"):
            f.unlink()

        for tmp_path in tmp_files:
            final_name = tmp_path.name.replace(".tmp_", "")
            tmp_path.rename(zoom_dir / final_name)

        result[zoom] = part_idx + 1

    return result

2D Tiling¶

Quick start¶

StreamingTileGenerator2D¶

Constructor parameters¶

Geometry-type codes¶

Ingestion methods¶

add_geojson¶

add_geojson_files¶

add_feature¶

add_parquet_points¶

add_parquet_polygons¶

feature_count_val¶

Output methods¶

generate_pbf¶

generate_parquet_native¶

generate_all¶

CartesianProjector2D¶

Python helper functions (mudm_tools.tiling2d)¶

tiling2d.generate_pbf¶

tiling2d.read_pbf¶

tiling2d.generate_parquet¶

tiling2d.read_parquet¶

Partition maintenance utilities¶

prime_parquet¶

deprime_parquet¶

repartition_parquet¶

Output structures¶

PBF (MVT) — generate_pbf / generate_all¶

Tiled Parquet — partitioned (8-col, no zoom column)¶

Tiled Parquet — single file (9-col, with zoom column)¶

Primed fast path¶

End-to-end example¶

See also¶

API reference (autodoc)¶

tiling2d ¶

deprime_parquet ¶

prime_parquet ¶

repartition_parquet ¶

read_parquet ¶

generate_parquet ¶

read_pbf ¶

generate_pbf ¶

generate_pbf ¶

read_pbf ¶

generate_parquet ¶

read_parquet ¶

prime_parquet ¶

deprime_parquet ¶

repartition_parquet ¶

`StreamingTileGenerator2D`¶

`add_geojson`¶

`add_geojson_files`¶

`add_feature`¶

`add_parquet_points`¶

`add_parquet_polygons`¶

`feature_count_val`¶

`generate_pbf`¶

`generate_parquet_native`¶

`generate_all`¶

`CartesianProjector2D`¶

Python helper functions (`mudm_tools.tiling2d`)¶

`tiling2d.generate_pbf`¶

`tiling2d.read_pbf`¶

`tiling2d.generate_parquet`¶

`tiling2d.read_parquet`¶

`prime_parquet`¶

`deprime_parquet`¶

`repartition_parquet`¶

PBF (MVT) — `generate_pbf` / `generate_all`¶

Tiled Parquet — partitioned (8-col, no `zoom` column)¶

Tiled Parquet — single file (9-col, with `zoom` column)¶