Using @workgroup_id for Parallel Tile Processing

When a continent-scale point cloud is partitioned into a regular tile grid and handed to a single compute dispatch, the one mapping that decides whether the pass scales or corrupts its output is how each workgroup discovers which tile it owns. WebGPU gives every workgroup a read-only @builtin(workgroup_id) — a vec3<u32> index into the dispatch grid — and the entire correctness of parallel tile processing rests on turning that index into a deterministic, non-overlapping slice of the spatial dataset. Get the offset arithmetic right and a thousand workgroups read disjoint tile buffers with no synchronisation; get it wrong and you produce double-counted features, visible seams between tiles, and atomic contention that serialises the whole pass. This page shows the exact @workgroup_id → tile-offset derivation, the values you must tune around it, and the spatial-data-specific ways it fails.

This is one technique within Optimization Flags for Compute Dispatches, and it assumes you already understand why a compute pass is scheduled differently from a render pass.

The sub-problem: deterministic tile ownership

A 2D dispatch launches (groupCountX, groupCountY) workgroups. WebGPU guarantees each one receives a unique workgroup_id in that range but makes no guarantee about execution order or concurrency. That means a tile partition is only safe if every workgroup writes to memory derived solely from its own workgroup_id — no shared cursor, no overlapping ranges. The job is to convert the 2D logical index into a flat tile offset, fetch that tile’s bounds and primitive range, and process only those primitives. Reserve workgroup_id.z for an orthogonal axis (LOD level, time step, or spectral band) so a single dispatch can sweep a 3D problem without re-encoding.

Runnable reference implementation

The kernel below takes a grid of pre-partitioned tiles, derives the owning tile from @workgroup_id, early-exits tiles outside the viewport, and compacts surviving primitives into a contiguous output buffer. Each invocation within the workgroup handles one primitive of the tile.

wgsl

// Tile metadata: xy = min corner, zw = max corner (world units), one entry per tile.
@group(0) @binding(0) var<storage, read>       tile_bounds: array<vec4<f32>>;
// Flat list of every tile's first primitive index + primitive count: x = start, y = count.
@group(0) @binding(1) var<storage, read>       tile_ranges: array<vec2<u32>>;
// Source primitives, Structure-of-Arrays friendly: xy = position, zw = packed attributes.
@group(0) @binding(2) var<storage, read>       primitives:  array<vec4<f32>>;
// Compacted survivors written here; out_count tracks the running write head.
@group(0) @binding(3) var<storage, read_write> filtered:    array<vec4<f32>>;
@group(0) @binding(4) var<storage, read_write> out_count:   atomic<u32>;
// Dispatch + view uniforms passed from the host every frame.
@group(0) @binding(5) var<uniform>             view: Uniforms;

struct Uniforms {
    grid_width:  u32,        // tiles across — used to flatten workgroup_id.xy
    grid_height: u32,
    viewport:    vec4<f32>,  // current visible AABB in world units
};

const TILE_THREADS: u32 = 64u; // must equal @workgroup_size product

fn aabb_overlaps(a: vec4<f32>, b: vec4<f32>) -> bool {
    return a.x <= b.z && a.z >= b.x && a.y <= b.w && a.w >= b.y;
}

@compute @workgroup_size(64, 1, 1)
fn process_tile(
    @builtin(workgroup_id)            wg_id: vec3<u32>,
    @builtin(local_invocation_index)  lid:   u32,
) {
    // 1. Flatten the 2D tile index. Row-major must match the host's tile layout.
    let tile_index = wg_id.x + wg_id.y * view.grid_width;

    // 2. Guard the padded tail: ceil division over-launches the last row/column.
    if (tile_index >= view.grid_width * view.grid_height) { return; }

    // 3. Cull the whole workgroup before touching primitive data.
    let bounds = tile_bounds[tile_index];
    if (!aabb_overlaps(bounds, view.viewport)) { return; }

    // 4. Each invocation strides across the tile's primitive range.
    let range = tile_ranges[tile_index];   // x = start offset, y = count
    var i = lid;
    while (i < range.y) {
        let prim = primitives[range.x + i];
        // Per-primitive predicate (point-in-viewport shown; swap for clip/LOD tests).
        if (prim.x >= view.viewport.x && prim.x <= view.viewport.z &&
            prim.y >= view.viewport.y && prim.y <= view.viewport.w) {
            // 5. Atomic compaction yields a packed output, no CPU defrag needed.
            let slot = atomicAdd(&out_count, 1u);
            filtered[slot] = prim;
        }
        i = i + TILE_THREADS;
    }
}

Host side, the dispatch grid is derived from the tile grid with ceiling division so partial edge tiles are never dropped:

typescript

// tileGrid was produced by the data-prep step; one workgroup per tile.
const groupCountX = tileGrid.cols; // grid is already tile-granular, not pixel-granular
const groupCountY = tileGrid.rows;

device.queue.writeBuffer(uniformBuffer, 0, new Uint32Array([
  tileGrid.cols,
  tileGrid.rows,
]));

const pass = encoder.beginComputePass();
pass.setPipeline(tilePipeline);
pass.setBindGroup(0, tileBindGroup);
pass.dispatchWorkgroups(groupCountX, groupCountY, 1);
pass.end();

Because the host launches exactly one workgroup per tile, workgroup_id.xy is the tile coordinate — no per-thread offset multiply is needed at the grid level, and the in-shader stride loop in step 4 handles tiles whose primitive count exceeds the 64-thread workgroup. The compaction write head uses a single global atomic; when tile density is high, replace it with a workgroup-local count plus one atomicAdd per workgroup, exactly as covered in Optimization Flags for Compute Dispatches.

Parameter and configuration reference

Every tunable in the code above, with guidance for geospatial workloads:

Value	Where	Spatial-workload guidance
`@workgroup_size(64, 1, 1)`	WGSL entry point	Match a divisor-friendly multiple of the warp/wavefront width (32 NVIDIA, 64 AMD). 64 is the safe default for irregular vector tiles; drop to 32 on mobile SoCs to avoid register spills.
`TILE_THREADS`	WGSL const	Must equal the product of `@workgroup_size`. The stride loop reads it; a mismatch silently skips or re-reads primitives.
`grid_width` / `grid_height`	uniform	Tiles across/down. Must be `ceil(extent / tileSize)`; pad so no primitive falls outside the grid. Pass as a uniform, never hard-code, so viewport zoom can re-tile at runtime.
`workgroup_id.z`	dispatch	Reserve for LOD level, time step, or spectral band. Leave at depth 1 for a flat tile sweep.
Tile size (world units)	data-prep	Target ~256–4096 primitives per tile. Too small wastes a whole workgroup on a near-empty tile; too large overruns the stride loop and starves occupancy. Align to your backend’s quadtree/H3 cell size.
`tile_bounds` stride	buffer layout	One `vec4<f32>` (16 bytes) per tile satisfies the std430 alignment rules described in memory alignment for spatial data buffers.
`out_count` atomic	buffer	Single global counter is simplest; switch to per-workgroup reduction when feature density per tile climbs past a few thousand.

The wasted-lane cost of the padded tail is worth stating precisely. For a grid of $N$ tiles dispatched as $\lceil N / W \rceil$ workgroups of width $W$, the idle invocations in the final block number $W\lceil N/W \rceil - N$, which is why a tile count that is a clean multiple of the workgroup width gives the tightest occupancy.

Failure modes specific to tile mapping

Row-major / column-major mismatch. The shader flattens with wg_id.x + wg_id.y * grid_width, but the host packed tile_bounds column-major. Detection: output looks transposed or features land in the wrong region; a single-tile test grid (1×N) renders correctly while a 2D grid scrambles. Fix: make the host packing order and the in-shader flatten use the identical convention — pick row-major everywhere.
Overlapping tile ranges (double counting). Two tile_ranges entries share primitive indices because the partition step rounded boundaries inconsistently. Detection: out_count exceeds the input cardinality; the same feature appears twice in filtered. Fix: generate half-open [start, start+count) ranges in the data-prep pass and assert they tile the buffer with no gaps or overlaps.
Dropped tail tiles. Using floor instead of ceil for grid_width/grid_height omits the partial edge row, leaving a strip of geometry unprocessed. Detection: a clean missing band on the right/bottom edge of the map. Fix: Math.ceil(extent / tileSize) and keep the tile_index >= grid_width * grid_height guard for the over-launched tail.
Atomic-contention collapse on dense tiles. Every workgroup writing one global out_count serialises on dense urban tiles — correct results, but frame time tracks feature density instead of algorithm cost. Detection: a timestamp query pins the time inside this pass on dense viewports only. Fix: accumulate survivors in var<workgroup> and emit one atomicAdd per workgroup; pair shared writes with workgroupBarrier(). Devices without compute support should fall through the browser support fallback routing strategies.

Backend / Python interop note

The dispatch grid is only correct if the Python data-prep step produces tiles whose count, order, and bounds match the uniforms the host uploads. When exporting from GeoPandas to GeoParquet, partition geometries into the same grid the shader will sweep and serialise the per-tile primitive ranges so the GPU never recomputes them:

python

import geopandas as gpd
import numpy as np

gdf = gpd.read_parquet("features.parquet")
minx, miny, maxx, maxy = gdf.total_bounds

tile_size = 2048.0  # world units; must match the host uniform
cols = int(np.ceil((maxx - minx) / tile_size))
rows = int(np.ceil((maxy - miny) / tile_size))

# Row-major tile id — identical convention to the WGSL flatten.
cx = ((gdf.geometry.x - minx) // tile_size).astype("int32").clip(0, cols - 1)
cy = ((gdf.geometry.y - miny) // tile_size).astype("int32").clip(0, rows - 1)
gdf["tile_id"] = cy * cols + cx

# Sort by tile so each tile's primitives are contiguous -> coalesced GPU reads.
gdf = gdf.sort_values("tile_id").reset_index(drop=True)

# Build the [start, count] ranges the shader binds as tile_ranges.
counts = gdf.groupby("tile_id").size().reindex(range(cols * rows), fill_value=0)
starts = np.concatenate([[0], np.cumsum(counts.values)[:-1]])
tile_ranges = np.stack([starts, counts.values], axis=1).astype("uint32")

Three alignment rules carry across the boundary: the tile_size, cols, and rows here must equal the host uniform; the row-major tile_id must match the in-shader flatten; and sorting features by tile_id is what makes each tile’s primitives contiguous in the buffer, which is what lets the stride loop read coalesced memory rather than scattering across VRAM. Write the sorted positions and the tile_ranges array as separate columns so the upload stays Structure-of-Arrays, satisfying the same 16-byte alignment the GPU side expects.

Optimization Flags for Compute Dispatches — the workgroup-sizing, atomic-routing, and indirect-dispatch flags this tile mapping plugs into.
Geometry Filtering with WGSL Compute Shaders — the per-primitive predicates that run inside each tile’s stride loop.
Spatial Aggregation in GPU Memory — accumulating per-tile densities once features are partitioned.
Async Dispatch Patterns for Spatial Clustering — chaining tile passes without CPU-GPU stalls.
Memory Alignment for Spatial Data Buffers — the layout rules behind tile-bounds and range buffers.

Up: Optimization Flags for Compute Dispatches