Skip to content

Format Converters

mudm-tools ships a small converter registry that turns common source formats into ready-to-serve muDM tiled output — MVT vector tiles, partitioned Parquet, and (for imaging data) a PNG raster pyramid. Three converters are built in:

Format Source Output Backend
xenium 10x Genomics Xenium output bundle MVT + Parquet + raster pyramid StreamingTileGenerator2D
obj Wavefront OBJ mesh files octree 3D Tiles (GLB/Meshopt) + Parquet StreamingTileGenerator
geojson GeoJSON file or directory quadtree MVT + Parquet StreamingTileGenerator2D

All three delegate the heavy lifting to the Rust extension mudm_tools._rs and to mudm_tools.tiling2d.generate_pbf. You drive them through one of two interfaces — the Python convert() function or the python -m mudm_tools.converters.cli command line.

No mudm convert console script

The CLI module's prog name and in-code docstrings say mudm convert ..., but no mudm console script is installed. The only console entry point in this package is mudm-serve. Always invoke the converter CLI as python -m mudm_tools.converters.cli ....

Quick start

from mudm_tools.converters import convert, list_formats

# Discover what's registered
print(list_formats())  # ['geojson', 'obj', 'xenium']

# Convert a GeoJSON file to MVT + Parquet
result = convert(
    "geojson",
    input_dir="annotations.geojson",
    output_dir="tiles/annotations",
    config={"max_zoom": 7, "bounds": (0.0, 0.0, 10000.0, 10000.0)},
)
print(result["feature_count"])
# Discover what's registered
uv run python -m mudm_tools.converters.cli list-formats

# Convert a GeoJSON file to MVT + Parquet
uv run python -m mudm_tools.converters.cli convert \
    --format geojson \
    --input annotations.geojson \
    --output tiles/annotations \
    --max-zoom 7

The Python API

Two top-level functions live in mudm_tools.converters.

convert

convert(
    format: str,
    input_dir: str,
    output_dir: str,
    config: dict[str, Any] | None = None,
) -> dict

Looks up format in the registry, instantiates the registered converter class, and calls converter.convert(input_dir, output_dir, config or {}).

Parameter Type Default Description
format str One of "xenium", "obj", "geojson".
input_dir str Path to the source data directory or file.
output_dir str Path for tiled output (parents created as needed).
config dict[str, Any] \| None None Converter-specific settings; None is coerced to {}.

If format is not registered, convert() raises ValueError with the message Unknown format '<format>'. Available: geojson, obj, xenium.

The return dict shape is converter-dependent

There is no uniform result schema. All three converters return total_time (float seconds) and a timings dict, but:

  • Xenium returns layer_counts (dict[str, int]) and tile_count (int).
  • OBJ and GeoJSON return feature_count (int) — and have no layer_counts or tile_count.

Always read the keys for the converter you called. See the per-converter sections below.

list_formats

list_formats() -> list[str]

Returns the sorted list of registered format names. With the three built-in converters this is exactly ['geojson', 'obj', 'xenium'].

Autodoc

converters

muDM format converters — standardized entry points for source data ingestion.

Each converter transforms a specific source format into muDM tiled output (MVT + Parquet + optional raster tiles).

Usage

from mudm_tools.converters import convert

convert("xenium", input_dir="data/outs", output_dir="tiles/sample", config={"temp_dir": "/data/tmp"})

CLI

mudm convert --format xenium --input data/outs --output tiles/sample

convert

convert(
    format: str,
    input_dir: str,
    output_dir: str,
    config: dict[str, Any] | None = None,
) -> dict

Run a converter by format name.

Parameters:

Name Type Description Default
format str

Converter name (e.g., "xenium", "obj", "geojson").

required
input_dir str

Path to source data directory or file.

required
output_dir str

Path for tiled output.

required
config dict[str, Any] | None

Optional dict of converter-specific settings.

None

Returns:

Type Description
dict

Dict with conversion metadata (feature counts, timing, etc.).

Source code in src/mudm_tools/converters/__init__.py
def convert(
    format: str,
    input_dir: str,
    output_dir: str,
    config: dict[str, Any] | None = None,
) -> dict:
    """Run a converter by format name.

    Args:
        format: Converter name (e.g., "xenium", "obj", "geojson").
        input_dir: Path to source data directory or file.
        output_dir: Path for tiled output.
        config: Optional dict of converter-specific settings.

    Returns:
        Dict with conversion metadata (feature counts, timing, etc.).
    """
    if format not in _REGISTRY:
        available = ", ".join(sorted(_REGISTRY.keys()))
        raise ValueError(f"Unknown format {format!r}. Available: {available}")

    converter = _REGISTRY[format]()
    return converter.convert(input_dir, output_dir, config or {})

list_formats

list_formats() -> list[str]

Return registered converter format names.

Source code in src/mudm_tools/converters/__init__.py
def list_formats() -> list[str]:
    """Return registered converter format names."""
    return sorted(_REGISTRY.keys())

register

register(name: str)

Decorator to register a converter class.

Source code in src/mudm_tools/converters/__init__.py
def register(name: str):
    """Decorator to register a converter class."""

    def decorator(cls):
        _REGISTRY[name] = cls
        return cls

    return decorator

The CLI

The CLI exposes two subcommands.

python -m mudm_tools.converters.cli convert  --format <fmt> -i <in> -o <out> [options]
python -m mudm_tools.converters.cli list-formats

convert

Flag Alias Type Default Description
--format -f str required Source format: xenium, obj, or geojson.
--input -i str required Path to source data directory or file.
--output -o str required Path for tiled output.
--config -c str None Path to a JSON config file with converter-specific settings.
--temp-dir str None Temp directory; injected as config["temp_dir"] when set.
--max-zoom int None Override max zoom; injected as config["max_zoom"] when set.

On success, convert prints the result dict as Result: <pretty JSON>.

Config precedence

The --config JSON file is loaded first, then --temp-dir and --max-zoom overwrite the corresponding keys. Use the JSON file for converter-specific keys (bounds, tags, point_zoom_offset, …) that have no dedicated flag.

# Inline overrides only
uv run python -m mudm_tools.converters.cli convert \
    --format xenium \
    --input data/Xenium_outs \
    --output tiles/xenium_sample \
    --temp-dir /data/tmp \
    --max-zoom 8

# Rich config from a JSON file
uv run python -m mudm_tools.converters.cli convert \
    --format obj \
    --input data/meshes \
    --output tiles/brain \
    --config obj_config.json

list-formats

$ uv run python -m mudm_tools.converters.cli list-formats
  geojson
  obj
  xenium

Autodoc

cli

Unified CLI for muDM format conversion.

Usage

mudm convert --format xenium --input data/outs --output tiles/sample mudm convert --format obj --input data/meshes --output tiles/brain mudm convert --format geojson --input data/cells.geojson --output tiles/cells mudm list-formats


Xenium converter

XeniumConverter (registered as "xenium") converts a 10x Genomics Xenium output bundle into a full muDM tiled tree: MVT vector tiles for cells/nuclei polygons and transcripts points, partitioned Parquet, and a PNG raster pyramid built from the DAPI morphology image.

The three layers are fixed:

Layer Source file Geometry Color
cells cell_boundaries.parquet polygon #00ffff
nuclei nucleus_boundaries.parquet polygon #00ff00
transcripts transcripts.parquet point (x_location/y_location/feature_name) #ff4444

Missing layer files are skipped with a printed message rather than raising. Polygon layers tile from min zoom 0; the transcripts layer starts deeper (see point_zoom_offset).

Extra dependencies required

The Xenium converter's runtime deps are gated behind the [xenium] extra: polars, tifffile, and pillow. Install them with:

pip install mudm-tools[xenium]
# or, for development:
uv run --extra xenium pytest

These are imported lazily inside the converter's methods, so import mudm_tools.converters.xenium (and hence import mudm_tools.converters) succeeds without the extra. You only need it to actually run .convert(). Note that numpy is imported at module top level, so it is a hard dependency of importing the module (it ships as a core dependency).

Config keys

All keys are read from config via config.get(...).

Key Type Default Description
temp_dir str system temp (tempfile.gettempdir()) Temp dir for MVT/Parquet fragments.
max_zoom int derived from image size Override raster max zoom. Default is ceil(log2(max(h, w) / 256)), or 7 if no morphology image. vector_max_zoom = max_zoom + 1.
point_zoom_offset int 3 Transcripts layer min zoom = max(0, vector_max_zoom - point_zoom_offset).
id_column str "cell_id" Boundary polygon ID column for cells and nuclei.
skip_raster bool false Skip raster generation if a raster/ dir already exists (max_zoom inferred from existing tiles).

Return value

{
    "total_time": 412.7,                # float seconds
    "timings": {                        # dict: "raster" + per-layer {ingest, pbf, parquet}
        "raster": 38.1,
        "cells": {"ingest": 12.0, "pbf": 9.4, "parquet": 5.2},
        "nuclei": {"ingest": 11.6, "pbf": 9.1, "parquet": 5.0},
        "transcripts": {"ingest": 80.3, "pbf": 70.2, "parquet": 41.7},
    },
    "layer_counts": {                   # dict[str, int], per present layer
        "cells": 167780,
        "nuclei": 167780,
        "transcripts": 42638083,
    },
    "tile_count": 18421,                # int — merged MVT tiles
}

Example

from mudm_tools.converters import convert

result = convert(
    "xenium",
    input_dir="data/Xenium_outs",
    output_dir="tiles/xenium_sample",
    config={
        "temp_dir": "/data/tmp",
        "point_zoom_offset": 3,   # transcripts only at detailed zooms
        "id_column": "cell_id",
        "skip_raster": False,
    },
)
print(result["layer_counts"])  # {'cells': 167780, 'nuclei': 167780, 'transcripts': 42638083}
print(result["tile_count"])
uv run python -m mudm_tools.converters.cli convert \
    --format xenium \
    --input data/Xenium_outs \
    --output tiles/xenium_sample \
    --temp-dir /data/tmp

With a config file for the Xenium-specific keys:

xenium_config.json
{
  "temp_dir": "/data/tmp",
  "point_zoom_offset": 3,
  "id_column": "cell_id",
  "skip_raster": false
}
uv run python -m mudm_tools.converters.cli convert \
    --format xenium \
    --input data/Xenium_outs \
    --output tiles/xenium_sample \
    --config xenium_config.json

Output structure

output_dir/
  metadata.json            # name, platform, um_per_px, bounds_um, raster{}, vectors{layers}, parquet{}
  gene_list.json           # sorted unique transcript feature names (only if transcripts.parquet present)
  raster/                  # PNG tile pyramid (only if morphology image present / not skipped)
    {z}/{x}/{y}.png        # 256x256 grayscale ("L") DAPI tiles
  vectors/                 # merged multi-layer MVT
    metadata.json          # TileJSON 3.0.0 (vector_layers: cells, nuclei, transcripts)
    {z}/{x}/{y}.pbf
  features.parquet/        # partitioned Parquet
    zoom={z}/<layer>_<part>.parquet

The resulting tree is exactly what mudm-serve expects. See the 2D Tiling guide for the tile model and the TileJSON reference for the vectors/metadata.json schema.

Building a FeatureCollection directly: xenium_to_mudm

mudm_tools.converters.xenium also exposes a lower-level helper that is not part of the registry. Use it when you want an in-memory muDM object instead of tiles.

xenium_to_mudm(
    cell_boundaries_path: Path | str,
    cell_feature_matrix_path: Path | str,
    cells_path: Path | str | None = None,
    cell_type_annotations: Path | str | None = None,
    max_cells: int | None = None,
)  # -> mudm.model.MuDMFeatureCollection
Parameter Type Default Description
cell_boundaries_path Path \| str cell_boundaries.parquet (cell_id, vertex_x, vertex_y).
cell_feature_matrix_path Path \| str The cell_feature_matrix/ directory (matrix.mtx.gz + barcodes.tsv.gz + features.tsv.gz). A .zarr.zip path is accepted only if a sibling cell_feature_matrix/ dir exists, else raises ValueError.
cells_path Path \| str \| None None cells.parquet summary (centroids/counts/areas).
cell_type_annotations Path \| str \| None None 10x clusters.csv (Barcode, Cluster) → properties["cluster_id"].
max_cells int \| None None Truncate to the first N cell IDs in sorted order; None = all.

It builds one closed-polygon MuDMFeature per cell, stores the per-cell expression vector as a JSON-encoded string under properties["expression"] (to survive the map<utf8,utf8> Parquet tags schema), and sets collection-level properties {platform: "xenium", crs: {type: "physical", units: "micrometers"}, gene_panel_dimension, gene_panel}. Coordinates stay in physical micrometres (Xenium native, not normalized). Cells whose ring has fewer than 4 positions after closure are skipped.

Dependencies for xenium_to_mudm

xenium_to_mudm relies only on packages that are already core dependencies of mudm-toolsmudm, geojson_pydantic, pyarrow, numpy, and scipy (all declared in [project].dependencies) — so it needs no extra install beyond the base package. The [xenium] extra (polars, tifffile, pillow) is required only by the full XeniumConverter.convert() raster / gene-list paths, not by xenium_to_mudm itself.


OBJ converter

ObjConverter (registered as "obj") converts a directory of Wavefront OBJ meshes into octree-tiled 3D Tiles (GLB/Meshopt) plus optional partitioned Parquet, using mudm_tools._rs.StreamingTileGenerator. Ingest is parallelized across files with Rayon, and world bounds are auto-derived via scan_obj_bounds when not supplied.

Config keys

All keys are read from config via config.get(...).

Key Type Default Description
temp_dir str None Temp dir for fragments (passed straight to the Rust generator; None uses the generator's own default).
max_zoom int 4 Octree max zoom level.
min_zoom int 0 Octree min zoom level.
bounds tuple auto-scan World bounds — 6-tuple (xmin, ymin, zmin, xmax, ymax, zmax). Auto-derived via scan_obj_bounds when omitted.
tags dict {} Map of filename-stem → property dict; files with no entry get {"name": <stem>}.
glob str "*.obj" Glob pattern for selecting OBJ files inside input_dir.
generate_parquet bool true Also emit Parquet (generate_parquet_native with simplify=True).

Bounds are 3D for OBJ

OBJ bounds is a 6-tuple (it includes zmin/zmax), unlike the 4-tuple used by the GeoJSON converter. Leaving it unset triggers a full bounds scan of every OBJ file.

Return value

{
    "total_time": 22.4,        # float seconds
    "feature_count": 237,      # int — number of feature ids from add_obj_files
    "timings": {               # dict
        "ingest": 6.1,
        "tiles": 12.8,
        "parquet": 3.5,        # 0 when generate_parquet is False
    },
}

There is no layer_counts or tile_count key. If no files match the glob, convert() raises FileNotFoundError.

Example

from mudm_tools.converters import convert

result = convert(
    "obj",
    input_dir="data/meshes/",
    output_dir="tiles/brain",
    config={
        "max_zoom": 4,
        "min_zoom": 0,
        "temp_dir": "/data/tmp",
        "tags": {
            "neuron_001": {"name": "L5 pyramidal", "region": "MOp"},
            "neuron_002": {"name": "interneuron", "region": "MOp"},
        },
        "generate_parquet": True,
    },
)
print(result["feature_count"])
uv run python -m mudm_tools.converters.cli convert \
    --format obj \
    --input data/meshes \
    --output tiles/brain \
    --max-zoom 4 \
    --temp-dir /data/tmp

Per-file tags and 3D bounds have no dedicated flags — pass them via --config:

obj_config.json
{
  "max_zoom": 4,
  "min_zoom": 0,
  "bounds": [0.0, 0.0, 0.0, 8192.0, 8192.0, 8192.0],
  "tags": {
    "neuron_001": {"name": "L5 pyramidal", "region": "MOp"}
  },
  "generate_parquet": true
}

Output structure

output_dir/
  3dtiles/                 # octree 3D Tiles (tileset.json + GLB/Meshopt content) from generate_3dtiles
  features.parquet/        # partitioned Parquet (only when generate_parquet=True)
    zoom={z}/...

For the full 3D Tiles model, compression options, and viewer, see the 3D Tiling guide.


GeoJSON converter

GeoJsonConverter (registered as "geojson") converts a single GeoJSON file or a directory of GeoJSON files into quadtree-tiled MVT vector tiles plus partitioned Parquet, using mudm_tools._rs.StreamingTileGenerator2D and mudm_tools.tiling2d.generate_pbf. A single file is ingested with add_geojson(text, bounds); a directory is ingested with add_geojson_files([paths], bounds).

Config keys

All keys are read from config via config.get(...).

Key Type Default Description
temp_dir str system temp (tempfile.gettempdir()) Temp dir for fragments.
max_zoom int 7 Quadtree max zoom level.
min_zoom int 0 Quadtree min zoom level.
bounds tuple auto-compute (stub) World bounds — 4-tuple (xmin, ymin, xmax, ymax).
layer_name str "features" MVT layer name.
glob str "*.geojson" Glob pattern, only used when input_dir is a directory.

Pass bounds explicitly

The GeoJSON converter's internal bounds scanner is currently a stub_update_bounds_from_coords does nothing — so when bounds is omitted the auto-computed bounds fall back to (0, 0, 1, 1). That will mis-tile any real dataset. Always supply bounds explicitly as a 4-tuple covering your data's world extent.

Return value

{
    "total_time": 3.9,         # float seconds
    "feature_count": 5821,     # int — len of ids from add_geojson / add_geojson_files
    "timings": {               # dict
        "ingest": 0.8,
        "pbf": 2.4,
        "parquet": 0.7,
    },
}

There is no layer_counts or tile_count key. If no GeoJSON files are found, convert() raises FileNotFoundError.

Example

from mudm_tools.converters import convert

# Single file — supply bounds explicitly!
result = convert(
    "geojson",
    input_dir="annotations.geojson",
    output_dir="tiles/annotations",
    config={
        "max_zoom": 7,
        "min_zoom": 0,
        "layer_name": "annotations",
        "bounds": (0.0, 0.0, 10000.0, 10000.0),
    },
)
print(result["feature_count"])

# A directory of GeoJSON files
result = convert(
    "geojson",
    input_dir="data/regions/",
    output_dir="tiles/regions",
    config={"glob": "*.geojson", "bounds": (0.0, 0.0, 10000.0, 10000.0)},
)
uv run python -m mudm_tools.converters.cli convert \
    --format geojson \
    --input annotations.geojson \
    --output tiles/annotations \
    --max-zoom 7 \
    --config geojson_config.json

bounds and layer_name have no dedicated flags — pass them via --config:

geojson_config.json
{
  "max_zoom": 7,
  "min_zoom": 0,
  "layer_name": "annotations",
  "bounds": [0.0, 0.0, 10000.0, 10000.0]
}

Output structure

output_dir/
  vectors/                 # MVT quadtree tiles
    {z}/{x}/{y}.pbf
  features.parquet/        # partitioned Parquet
    zoom={z}/...

See the 2D Tiling guide for the tile model and TileJSON reference for vector metadata.


Extending the registry

Adding a new converter is the same pattern the built-ins use: write a class with a convert(self, input_dir, output_dir, config) -> dict method, decorate it with @register("name"), and import the module so registration runs at import time.

from mudm_tools.converters import register

@register("myformat")
class MyConverter:
    def convert(self, input_dir: str, output_dir: str, config: dict) -> dict:
        # ... do work, write tiles to output_dir ...
        return {"total_time": 0.0, "feature_count": 0, "timings": {}}

The register decorator stores the class in the module-level _REGISTRY and returns the class unchanged. Once the module is imported, list_formats() will include "myformat" and convert("myformat", ...) will dispatch to it. The built-in converters register themselves via the from . import xenium / obj / geojson lines at the bottom of mudm_tools/converters/__init__.py.

Module reference

xenium

Xenium spatial transcriptomics → muDM tiled format.

Converts 10x Genomics Xenium output (boundaries, transcripts, DAPI image) into MVT vector tiles, partitioned Parquet, and a PNG raster tile pyramid.

Source files

cell_boundaries.parquet — polygon vertices (cell_id, vertex_x, vertex_y) nucleus_boundaries.parquet — polygon vertices transcripts.parquet — point detections (x_location, y_location, feature_name) morphology_focus.ome.tif — DAPI fluorescence image experiment.xenium — metadata (pixel_size) cells.parquet — per-cell summary (cell_id, x_centroid, y_centroid, …) cell_feature_matrix/ — sparse expression matrix (cells × features) ├── matrix.mtx.gz ├── barcodes.tsv.gz └── features.tsv.gz analysis/clustering/.../clusters.csv — graph clustering assignments

XeniumConverter

Convert 10x Genomics Xenium data to muDM tiled format.

convert

convert(
    input_dir: str, output_dir: str, config: dict[str, Any]
) -> dict

Run the full Xenium → muDM conversion.

Config keys

temp_dir (str): Temp directory for fragments. Default: system temp. max_zoom (int): Override max zoom level. Default: derived from image. point_zoom_offset (int): Transcripts start at max_zoom - offset. Default: 3. id_column (str): Boundary ID column name. Default: "cell_id". skip_raster (bool): Skip raster tile generation. Default: False.

Source code in src/mudm_tools/converters/xenium.py
def convert(
    self,
    input_dir: str,
    output_dir: str,
    config: dict[str, Any],
) -> dict:
    """Run the full Xenium → muDM conversion.

    Config keys:
        temp_dir (str): Temp directory for fragments. Default: system temp.
        max_zoom (int): Override max zoom level. Default: derived from image.
        point_zoom_offset (int): Transcripts start at max_zoom - offset. Default: 3.
        id_column (str): Boundary ID column name. Default: "cell_id".
        skip_raster (bool): Skip raster tile generation. Default: False.
    """
    from mudm_tools._rs import StreamingTileGenerator2D
    from mudm_tools.tiling2d import generate_pbf

    data_dir = Path(input_dir)
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    temp_dir = config.get("temp_dir", tempfile.gettempdir())
    max_zoom_override = config.get("max_zoom")
    point_zoom_offset = config.get("point_zoom_offset", 3)
    id_column = config.get("id_column", "cell_id")
    skip_raster = config.get("skip_raster", False)

    timings: dict[str, float | dict[str, float]] = {}
    t_start = time.time()

    # Read pixel size
    experiment_path = data_dir / "experiment.xenium"
    um_per_px = self._read_um_per_px(experiment_path)
    print(f"Pixel size: {um_per_px} µm/px", flush=True)

    # Raster tiles
    raster_info, max_zoom = self._generate_raster(
        data_dir, out_dir, skip_raster, max_zoom_override
    )
    timings["raster"] = time.time() - t_start

    # Tile grid alignment
    vector_max_zoom = max_zoom + 1
    grid_size = 256.0 * (2**max_zoom)
    coord_scale = 1.0 / um_per_px
    tile_bounds = (0.0, 0.0, grid_size, grid_size)
    point_min_zoom = max(0, vector_max_zoom - point_zoom_offset)

    print(
        f"Tile grid: {int(grid_size)}×{int(grid_size)} px, "
        f"vector zoom 0-{vector_max_zoom} (raster 0-{max_zoom})",
        flush=True,
    )

    # Define layers
    layers = [
        ("cells", data_dir / "cell_boundaries.parquet", "polygon", id_column),
        ("nuclei", data_dir / "nucleus_boundaries.parquet", "polygon", id_column),
        ("transcripts", data_dir / "transcripts.parquet", "point", None),
    ]

    layer_counts = {}
    layer_fields = {}
    layer_min_zooms = {}
    layer_tmp_dirs = []

    for layer_name, parquet_path, geom_type, id_col in layers:
        if not parquet_path.exists():
            print(f"Skipping {layer_name}: {parquet_path.name} not found", flush=True)
            continue

        layer_min = point_min_zoom if geom_type == "point" else 0
        layer_min_zooms[layer_name] = layer_min

        gen = StreamingTileGenerator2D(
            min_zoom=layer_min,
            max_zoom=vector_max_zoom,
            buffer=64 / 4096.0,
            temp_dir=temp_dir,
        )

        print(
            f"Ingesting {layer_name} (zoom {layer_min}-{vector_max_zoom})...",
            end=" ",
            flush=True,
        )
        t0 = time.time()

        if geom_type == "point":
            count = gen.add_parquet_points(
                str(parquet_path),
                "x_location",
                "y_location",
                "feature_name",
                "gene_name",
                layer_name,
                tile_bounds,
                coord_scale,
            )
            layer_fields[layer_name] = {"gene_name": "String"}
        else:
            count = gen.add_parquet_polygons(
                str(parquet_path),
                id_col,
                "vertex_x",
                "vertex_y",
                layer_name,
                tile_bounds,
                coord_scale,
            )
            layer_fields[layer_name] = {"cell_id": "String"}

        layer_counts[layer_name] = count
        t_ingest = time.time() - t0
        print(f"{count:,} features ({t_ingest:.1f}s)", flush=True)

        # Encode PBF
        print("  Encoding PBF...", end=" ", flush=True)
        t0 = time.time()
        mvt_tmp = Path(tempfile.mkdtemp(dir=temp_dir, prefix=f"mvt_{layer_name}_"))
        generate_pbf(gen, str(mvt_tmp), tile_bounds, simplify=True, layer_name=layer_name)
        t_pbf = time.time() - t0
        print(f"done ({t_pbf:.1f}s)", flush=True)

        # Encode Parquet
        print("  Encoding Parquet...", end=" ", flush=True)
        t0 = time.time()
        pq_tmp = Path(tempfile.mkdtemp(dir=temp_dir, prefix=f"pq_{layer_name}_"))
        pq_rows = gen.generate_parquet_native(str(pq_tmp), tile_bounds, simplify=True)
        t_pq = time.time() - t0
        print(f"{pq_rows:,} rows ({t_pq:.1f}s)", flush=True)

        layer_tmp_dirs.append((layer_name, mvt_tmp, pq_tmp))
        timings[layer_name] = {"ingest": t_ingest, "pbf": t_pbf, "parquet": t_pq}

    # Merge MVT
    print("Merging MVT layers...", end=" ", flush=True)
    t0 = time.time()
    mvt_dir = out_dir / "vectors"
    mvt_dir.mkdir(parents=True, exist_ok=True)
    tile_files: dict[str, list[bytes]] = {}
    for layer_name, mvt_tmp, _ in layer_tmp_dirs:
        for pbf_path in mvt_tmp.rglob("*.pbf"):
            key = str(pbf_path.relative_to(mvt_tmp))
            tile_files.setdefault(key, []).append(pbf_path.read_bytes())
    for rel_path, chunks in tile_files.items():
        merged_path = mvt_dir / rel_path
        merged_path.parent.mkdir(parents=True, exist_ok=True)
        merged_path.write_bytes(b"".join(chunks))
    print(f"{len(tile_files)} tiles ({time.time() - t0:.1f}s)", flush=True)

    # Merge Parquet
    print("Merging Parquet partitions...", end=" ", flush=True)
    t0 = time.time()
    parquet_dir = out_dir / "features.parquet"
    parquet_dir.mkdir(parents=True, exist_ok=True)
    for layer_name, _, pq_tmp in layer_tmp_dirs:
        for zoom_dir in sorted(pq_tmp.glob("zoom=*")):
            target = parquet_dir / zoom_dir.name
            target.mkdir(exist_ok=True)
            for pq_file in zoom_dir.glob("*.parquet"):
                dest = target / f"{layer_name}_{pq_file.name}"
                shutil.move(str(pq_file), str(dest))
    print(f"done ({time.time() - t0:.1f}s)", flush=True)

    # Clean up temp
    for _, mvt_tmp, pq_tmp in layer_tmp_dirs:
        shutil.rmtree(mvt_tmp, ignore_errors=True)
        shutil.rmtree(pq_tmp, ignore_errors=True)

    # Write TileJSON
    tj = {
        "tilejson": "3.0.0",
        "version": "1.0.0",
        "name": "MuDM Vector Tiles",
        "description": "Multi-layer vector tiles generated by mudm",
        "tiles": ["{z}/{x}/{y}.pbf"],
        "minzoom": 0,
        "maxzoom": vector_max_zoom,
        "bounds": list(tile_bounds),
        "tile_count": len(tile_files),
        "vector_layers": [
            {
                "id": name,
                "fields": fields,
                "minzoom": layer_min_zooms.get(name, 0),
                "maxzoom": vector_max_zoom,
                "feature_count": layer_counts[name],
            }
            for name, fields in layer_fields.items()
        ],
    }
    (mvt_dir / "metadata.json").write_text(json.dumps(tj, indent=2))

    # Write metadata.json
    # Compute bounds_um from parquet files
    import polars as pl

    bounds_um = self._compute_bounds_um(data_dir, layers)

    vector_layers = []
    for name, fields in layer_fields.items():
        vector_layers.append(
            {
                "id": name,
                "name": name,
                "type": "point" if "gene_name" in fields else "polygon",
                "color": self.LAYER_COLORS.get(name, "#ffffff"),
                "min_zoom": layer_min_zooms.get(name, 0),
                "max_zoom": vector_max_zoom,
                "feature_count": layer_counts[name],
            }
        )

    metadata = {
        "name": data_dir.name,
        "platform": "xenium",
        "um_per_px": um_per_px,
        "bounds_um": list(bounds_um),
        "raster": {
            "path": "raster/{z}/{x}/{y}.png",
            "min_zoom": 0,
            "max_zoom": raster_info["max_zoom"],
            "tile_size": 256,
            "image_size_px": raster_info["image_size_px"],
        },
        "vectors": {"path": "vectors/{z}/{x}/{y}.pbf", "layers": vector_layers},
        "parquet": {"path": "features.parquet", "partitioned": True},
    }
    (out_dir / "metadata.json").write_text(json.dumps(metadata, indent=2))

    # Write gene_list.json for the viewer gene filter
    transcripts_path = data_dir / "transcripts.parquet"
    if transcripts_path.exists():
        df = pl.read_parquet(transcripts_path, columns=["feature_name"])
        if df.schema["feature_name"] == pl.Binary:
            df = df.with_columns(pl.col("feature_name").cast(pl.Utf8))
        genes = sorted(df["feature_name"].unique().to_list())
        (out_dir / "gene_list.json").write_text(json.dumps(genes))
        print(f"Wrote gene_list.json ({len(genes)} genes)", flush=True)

    total_time = time.time() - t_start
    print(f"Done. Output: {out_dir} ({total_time:.0f}s)", flush=True)

    return {
        "total_time": total_time,
        "timings": timings,
        "layer_counts": layer_counts,
        "tile_count": len(tile_files),
    }

xenium_to_mudm

xenium_to_mudm(
    cell_boundaries_path: Path | str,
    cell_feature_matrix_path: Path | str,
    cells_path: Path | str | None = None,
    cell_type_annotations: Path | str | None = None,
    max_cells: int | None = None,
)

Build a muDM FeatureCollection from a Xenium output bundle.

Each cell becomes a MuDMFeature whose geometry is the closed polygon produced by joining cell_boundaries.parquet rows on cell_id. The per-cell expression vector from cell_feature_matrix is stored under properties["expression"] as a JSON-encoded string of the list of integer counts (one entry per feature row in the matrix). This encoding is required because the downstream Parquet tags column is map<utf8, utf8> and the Rust ingest path silently drops array-valued properties; storing the list as a JSON string lets it round-trip through StreamingTileGenerator2D.add_geojson -> generate_parquet untouched. Consumers read it back via json.loads(props["expression"]).

Parameters:

Name Type Description Default
cell_boundaries_path Path | str

Path to cell_boundaries.parquet (vertex- per-row layout: cell_id, vertex_x, vertex_y).

required
cell_feature_matrix_path Path | str

Path to either the directory containing matrix.mtx.gz / barcodes.tsv.gz / features.tsv.gz (Xenium cell_feature_matrix/) or directly to a .zarr.zip archive (the latter requires the optional zarr extra).

required
cells_path Path | str | None

Optional path to cells.parquet (cell summary with centroids, total counts, areas). Used for cell-level metadata attachment when present.

None
cell_type_annotations Path | str | None

Optional path to a per-cell-cluster CSV (e.g. analysis/clustering/gene_expression_graphclust/clusters.csv). Each row is Barcode, Cluster. Attaches a cluster_id property and (when curated mappings are provided) Cell Ontology URIs in the future. The Xenium Rep1 "preview" dataset has no curated cell-type → ontology mapping; pass None to skip.

None
max_cells int | None

Truncate to the first N cells (in cell_id order) for smoke-test ingestion. Use None for the full dataset.

None

Returns:

Type Description

MuDMFeatureCollection with one Polygon feature per cell.

Coordinates are in physical micrometres (Xenium native).

Source code in src/mudm_tools/converters/xenium.py
def xenium_to_mudm(
    cell_boundaries_path: Path | str,
    cell_feature_matrix_path: Path | str,
    cells_path: Path | str | None = None,
    cell_type_annotations: Path | str | None = None,
    max_cells: int | None = None,
):
    """Build a muDM FeatureCollection from a Xenium output bundle.

    Each cell becomes a ``MuDMFeature`` whose geometry is the closed polygon
    produced by joining ``cell_boundaries.parquet`` rows on ``cell_id``. The
    per-cell expression vector from ``cell_feature_matrix`` is stored under
    ``properties["expression"]`` as a **JSON-encoded string** of the list
    of integer counts (one entry per feature row in the matrix). This
    encoding is required because the downstream Parquet ``tags`` column is
    ``map<utf8, utf8>`` and the Rust ingest path silently drops array-valued
    properties; storing the list as a JSON string lets it round-trip
    through ``StreamingTileGenerator2D.add_geojson`` -> ``generate_parquet``
    untouched. Consumers read it back via ``json.loads(props["expression"])``.

    Args:
        cell_boundaries_path: Path to ``cell_boundaries.parquet`` (vertex-
            per-row layout: ``cell_id``, ``vertex_x``, ``vertex_y``).
        cell_feature_matrix_path: Path to either the directory containing
            ``matrix.mtx.gz`` / ``barcodes.tsv.gz`` / ``features.tsv.gz``
            (Xenium ``cell_feature_matrix/``) or directly to a ``.zarr.zip``
            archive (the latter requires the optional ``zarr`` extra).
        cells_path: Optional path to ``cells.parquet`` (cell summary with
            centroids, total counts, areas). Used for cell-level metadata
            attachment when present.
        cell_type_annotations: Optional path to a per-cell-cluster CSV (e.g.
            ``analysis/clustering/gene_expression_graphclust/clusters.csv``).
            Each row is ``Barcode, Cluster``. Attaches a ``cluster_id``
            property and (when curated mappings are provided) Cell Ontology
            URIs in the future. The Xenium ``Rep1`` "preview" dataset has
            no curated cell-type → ontology mapping; pass ``None`` to skip.
        max_cells: Truncate to the first N cells (in ``cell_id`` order) for
            smoke-test ingestion. Use ``None`` for the full dataset.

    Returns:
        ``MuDMFeatureCollection`` with one ``Polygon`` feature per cell.
        Coordinates are in physical micrometres (Xenium native).
    """
    # Lazy import muDM model classes — keeps converters/__init__ light.
    from mudm.model import MuDMFeature, MuDMFeatureCollection
    from geojson_pydantic import Polygon
    import pyarrow.parquet as pq

    cell_boundaries_path = Path(cell_boundaries_path)
    cell_feature_matrix_path = Path(cell_feature_matrix_path)
    cells_path = Path(cells_path) if cells_path is not None else None
    cell_type_annotations = (
        Path(cell_type_annotations) if cell_type_annotations is not None else None
    )

    # 1) Boundary polygons -----------------------------------------------------
    table = pq.read_table(cell_boundaries_path)
    cols = table.column_names
    if "cell_id" not in cols:
        raise ValueError(f"cell_boundaries.parquet missing 'cell_id' column (got: {cols})")
    if "vertex_x" not in cols or "vertex_y" not in cols:
        raise ValueError(f"cell_boundaries.parquet missing 'vertex_x'/'vertex_y' (got: {cols})")

    cell_id_arr = table.column("cell_id").to_numpy(zero_copy_only=False)
    vx_arr = table.column("vertex_x").to_numpy(zero_copy_only=False)
    vy_arr = table.column("vertex_y").to_numpy(zero_copy_only=False)

    # Cap to first N cells using a stable, sorted cell_id ordering.
    distinct_ids = np.unique(cell_id_arr)
    distinct_ids.sort()
    if max_cells is not None:
        distinct_ids = distinct_ids[:max_cells]
    selected = set(distinct_ids.tolist())

    # 2) Optional per-cell summary (centroid, total_counts, areas) ------------
    cell_summary: dict[Any, dict[str, Any]] = {}
    if cells_path is not None and cells_path.exists():
        cells_table = pq.read_table(cells_path)
        cell_summary = _extract_cell_summary(cells_table, selected)

    # 3) Expression matrix -----------------------------------------------------
    expression_by_cell, gene_panel = _load_xenium_expression(
        cell_feature_matrix_path, only_cells=selected
    )

    # 4) Optional cluster annotations -----------------------------------------
    cluster_by_cell: dict[Any, int] = {}
    if cell_type_annotations is not None and cell_type_annotations.exists():
        cluster_by_cell = _load_xenium_clusters(cell_type_annotations, selected)

    # 5) Build features --------------------------------------------------------
    # Group vertex rows by cell_id while preserving 10x's vertex order.
    # mask = membership of selected ids (vectorised) -> stable group-by.
    mask = np.isin(cell_id_arr, distinct_ids)
    sel_ids = cell_id_arr[mask]
    sel_vx = vx_arr[mask]
    sel_vy = vy_arr[mask]
    # Build groups indexed by first-occurrence order
    groups: dict[Any, list[tuple[float, float]]] = {}
    for cid, x, y in zip(sel_ids.tolist(), sel_vx.tolist(), sel_vy.tolist()):
        groups.setdefault(cid, []).append((float(x), float(y)))

    features: list[MuDMFeature] = []
    for cell_id, ring in groups.items():
        if not ring:
            continue
        if ring[0] != ring[-1]:
            ring.append(ring[0])
        if len(ring) < 4:
            # GeoJSON polygons need >=4 positions (3 distinct + closure)
            continue

        # Expression is JSON-encoded as a string so it survives the
        # tags: map<utf8, utf8> Parquet schema. Decode with json.loads.
        expr_list = expression_by_cell.get(cell_id, [])
        props: dict[str, Any] = {
            "cell_id": str(cell_id),
            "expression": json.dumps(expr_list, separators=(",", ":")),
        }
        summary = cell_summary.get(cell_id)
        if summary is not None:
            props.update({k: v for k, v in summary.items() if v is not None})
        if cell_id in cluster_by_cell:
            props["cluster_id"] = cluster_by_cell[cell_id]

        feat = MuDMFeature(
            type="Feature",
            geometry=Polygon(type="Polygon", coordinates=[ring]),  # type: ignore[list-item]  # geojson-pydantic accepts coord lists
            properties=props,
        )
        features.append(feat)

    fc = MuDMFeatureCollection(
        type="FeatureCollection",
        features=features,
        properties={
            "platform": "xenium",
            "crs": {"type": "physical", "units": "micrometers"},
            "gene_panel_dimension": len(gene_panel),
            "gene_panel": gene_panel,
        },
    )
    return fc

obj

OBJ mesh → muDM tiled 3D format.

Converts OBJ mesh files into octree-tiled 3D Tiles (GLB + Meshopt) and partitioned Parquet.

Source files

*.obj — Wavefront OBJ mesh files (one per feature/neuron/region)

ObjConverter

Convert OBJ mesh files to muDM tiled 3D format.

convert

convert(
    input_dir: str, output_dir: str, config: dict[str, Any]
) -> dict

Convert OBJ meshes to tiled 3D output.

Config keys

temp_dir (str): Temp directory for fragments. max_zoom (int): Max zoom level. Default: 4. min_zoom (int): Min zoom level. Default: 0. bounds (tuple): World bounds (xmin,ymin,zmin,xmax,ymax,zmax). If not provided, scans all OBJ files. tags (dict): Per-file tags. Keys are filenames (without .obj), values are dicts of properties. glob (str): Glob pattern for OBJ files. Default: "*.obj". generate_parquet (bool): Also generate Parquet. Default: True.

Source code in src/mudm_tools/converters/obj.py
def convert(
    self,
    input_dir: str,
    output_dir: str,
    config: dict[str, Any],
) -> dict:
    """Convert OBJ meshes to tiled 3D output.

    Config keys:
        temp_dir (str): Temp directory for fragments.
        max_zoom (int): Max zoom level. Default: 4.
        min_zoom (int): Min zoom level. Default: 0.
        bounds (tuple): World bounds (xmin,ymin,zmin,xmax,ymax,zmax).
            If not provided, scans all OBJ files.
        tags (dict): Per-file tags. Keys are filenames (without .obj),
            values are dicts of properties.
        glob (str): Glob pattern for OBJ files. Default: "*.obj".
        generate_parquet (bool): Also generate Parquet. Default: True.
    """
    from mudm_tools._rs import StreamingTileGenerator

    data_dir = Path(input_dir)
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    temp_dir = config.get("temp_dir")
    max_zoom = config.get("max_zoom", 4)
    min_zoom = config.get("min_zoom", 0)
    bounds = config.get("bounds")
    tags_map = config.get("tags", {})
    glob_pattern = config.get("glob", "*.obj")
    do_parquet = config.get("generate_parquet", True)

    t_start = time.time()

    # Find OBJ files
    obj_files = sorted(data_dir.glob(glob_pattern))
    if not obj_files:
        raise FileNotFoundError(f"No {glob_pattern} files in {data_dir}")
    print(f"Found {len(obj_files)} OBJ files", flush=True)

    # Scan bounds if not provided
    if bounds is None:
        from mudm_tools._rs import scan_obj_bounds

        print("Scanning OBJ bounds...", end=" ", flush=True)
        t0 = time.time()
        bounds = scan_obj_bounds([str(f) for f in obj_files])
        print(f"done ({time.time()-t0:.1f}s)", flush=True)

    # Create generator
    gen = StreamingTileGenerator(
        min_zoom=min_zoom,
        max_zoom=max_zoom,
        temp_dir=temp_dir,
    )

    # Build tags list
    all_tags = []
    for f in obj_files:
        name = f.stem
        file_tags = tags_map.get(name, {"name": name})
        all_tags.append(file_tags)

    # Ingest all OBJ files (parallel Rayon)
    print(f"Ingesting {len(obj_files)} meshes...", end=" ", flush=True)
    t0 = time.time()
    fids = gen.add_obj_files(
        [str(f) for f in obj_files],
        bounds,
        all_tags,
    )
    t_ingest = time.time() - t0
    print(f"{len(fids)} features ({t_ingest:.1f}s)", flush=True)

    # Generate 3D Tiles
    print("Encoding 3D Tiles...", end=" ", flush=True)
    t0 = time.time()
    tiles_dir = out_dir / "3dtiles"
    gen.generate_3dtiles(str(tiles_dir), bounds)
    t_tiles = time.time() - t0
    print(f"done ({t_tiles:.1f}s)", flush=True)

    # Generate Parquet
    t_parquet = 0.0
    if do_parquet:
        print("Encoding Parquet...", end=" ", flush=True)
        t0 = time.time()
        pq_dir = out_dir / "features.parquet"
        pq_rows = gen.generate_parquet_native(
            str(pq_dir),
            bounds,
            simplify=True,
        )
        t_parquet = time.time() - t0
        print(f"{pq_rows:,} rows ({t_parquet:.1f}s)", flush=True)

    total_time = time.time() - t_start
    print(f"Done. Output: {out_dir} ({total_time:.0f}s)", flush=True)

    return {
        "total_time": total_time,
        "feature_count": len(fids),
        "timings": {"ingest": t_ingest, "tiles": t_tiles, "parquet": t_parquet},
    }

geojson

GeoJSON → muDM tiled 2D format.

Converts GeoJSON FeatureCollection files into quadtree-tiled MVT vector tiles and partitioned Parquet.

GeoJsonConverter

Convert GeoJSON files to muDM tiled 2D format.

convert

convert(
    input_dir: str, output_dir: str, config: dict[str, Any]
) -> dict

Convert GeoJSON to tiled output.

input_dir can be a single .geojson/.json file or a directory.

Config keys

temp_dir (str): Temp directory for fragments. max_zoom (int): Max zoom level. Default: 7. min_zoom (int): Min zoom level. Default: 0. bounds (tuple): World bounds (xmin,ymin,xmax,ymax). If not provided, computed from features. layer_name (str): MVT layer name. Default: "features". glob (str): Glob pattern if input_dir is a directory. Default: "*.geojson".

Source code in src/mudm_tools/converters/geojson.py
def convert(
    self,
    input_dir: str,
    output_dir: str,
    config: dict[str, Any],
) -> dict:
    """Convert GeoJSON to tiled output.

    input_dir can be a single .geojson/.json file or a directory.

    Config keys:
        temp_dir (str): Temp directory for fragments.
        max_zoom (int): Max zoom level. Default: 7.
        min_zoom (int): Min zoom level. Default: 0.
        bounds (tuple): World bounds (xmin,ymin,xmax,ymax).
            If not provided, computed from features.
        layer_name (str): MVT layer name. Default: "features".
        glob (str): Glob pattern if input_dir is a directory. Default: "*.geojson".
    """
    from mudm_tools._rs import StreamingTileGenerator2D
    from mudm_tools.tiling2d import generate_pbf

    input_path = Path(input_dir)
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    temp_dir = config.get("temp_dir", tempfile.gettempdir())
    max_zoom = config.get("max_zoom", 7)
    min_zoom = config.get("min_zoom", 0)
    bounds = config.get("bounds")
    layer_name = config.get("layer_name", "features")

    t_start = time.time()

    # Load GeoJSON
    if input_path.is_file():
        geojson_files = [input_path]
    else:
        glob_pattern = config.get("glob", "*.geojson")
        geojson_files = sorted(input_path.glob(glob_pattern))

    if not geojson_files:
        raise FileNotFoundError(f"No GeoJSON files found at {input_path}")

    # Compute bounds if not provided
    if bounds is None:
        bounds = self._compute_bounds(geojson_files)

    gen = StreamingTileGenerator2D(
        min_zoom=min_zoom,
        max_zoom=max_zoom,
        buffer=64 / 4096.0,
        temp_dir=temp_dir,
    )

    # Ingest
    print(f"Ingesting {len(geojson_files)} GeoJSON file(s)...", end=" ", flush=True)
    t0 = time.time()
    if len(geojson_files) == 1:
        geojson_str = geojson_files[0].read_text()
        fids = gen.add_geojson(geojson_str, bounds)
    else:
        fids = gen.add_geojson_files([str(f) for f in geojson_files], bounds)
    t_ingest = time.time() - t0
    print(f"{len(fids)} features ({t_ingest:.1f}s)", flush=True)

    # PBF
    print("Encoding PBF...", end=" ", flush=True)
    t0 = time.time()
    mvt_dir = out_dir / "vectors"
    generate_pbf(gen, str(mvt_dir), bounds, simplify=True, layer_name=layer_name)
    t_pbf = time.time() - t0
    print(f"done ({t_pbf:.1f}s)", flush=True)

    # Parquet
    print("Encoding Parquet...", end=" ", flush=True)
    t0 = time.time()
    pq_dir = out_dir / "features.parquet"
    pq_rows = gen.generate_parquet_native(str(pq_dir), bounds, simplify=True)
    t_pq = time.time() - t0
    print(f"{pq_rows:,} rows ({t_pq:.1f}s)", flush=True)

    total_time = time.time() - t_start
    print(f"Done. Output: {out_dir} ({total_time:.0f}s)", flush=True)

    return {
        "total_time": total_time,
        "feature_count": len(fids),
        "timings": {"ingest": t_ingest, "pbf": t_pbf, "parquet": t_pq},
    }

See also

  • 2D Tiling guide — the MVT + Parquet tile model the GeoJSON and Xenium converters produce.
  • 3D Tiling guide — the octree 3D Tiles model the OBJ converter produces.
  • CLI reference — full command-line reference, including mudm-serve.