Async Dispatch Patterns for Spatial Clustering in WebGPU

Clustering a few hundred thousand features per frame is the moment most browser GIS pipelines fall apart: the moment the application calls device.queue.submit() and then synchronously waits — or worse, maps a result buffer on the same tick — the main thread stalls, input handling backs up, and the map visibly hitches as the user pans. The concrete scenario this guide addresses is a viewport that streams new tiles on every camera move, where each tile must be filtered, binned, and reduced to cluster centroids on the GPU while the render loop keeps painting at 60fps. The solution is an asynchronous submission graph: record a chain of compute passes, submit once, and resolve completion through queue.onSubmittedWorkDone() so the main thread never blocks on the GPU. This article takes one stage of the spatial compute shaders and geometry pipelines architecture — the asynchronous orchestration around clustering — to production depth.

Prerequisites

This guide assumes working familiarity with the following before you attempt an asynchronous clustering pipeline:

A valid GPUDevice. Adapter negotiation, feature checks, and limit inspection are handled upstream during WebGPU device initialization for GIS workloads; this article assumes a device and device.queue are already in hand.
The distinction between compute and render work. Async dispatch only pays off when you understand why compute passes execute independently of draw calls — covered in the compute versus render pipeline fundamentals reference.
Storage-buffer alignment. Coordinate payloads must satisfy WGSL’s 16-byte rules described in memory alignment for spatial data buffers, or the binning pass will read garbage at tile boundaries.
Browser support. WebGPU ships in Chrome/Edge 113+ and recent Firefox and Safari 18+; environments without it need a graceful path through browser support and fallback routing strategies.
Data format assumptions. Geometry arrives as packed Structure-of-Arrays buffers (Float32Array coordinate pairs, Uint32Array attribute masks), typically prepared by a Python backend and uploaded with device.queue.writeBuffer().

Dispatch and Synchronization API Reference

The asynchronous behavior of a clustering pipeline is governed by a small set of queue and pass APIs. The table below summarizes the fields and methods that determine whether a submission blocks the main thread.

API surface	Purpose in clustering	Spatial-workload notes
`device.queue.submit(buffers)`	Hands recorded command buffers to the GPU	Fire-and-forget — returns immediately; never `await` it directly
`queue.onSubmittedWorkDone()`	Resolves when all prior submissions finish	Use to gate the next tile batch, not to block the render loop
`GPUBuffer.mapAsync(mode)`	Maps a readback buffer for CPU access	Only resolves after the copy completes; map a dedicated `MAP_READ` buffer, never a `STORAGE` buffer in flight
`GPUComputePassEncoder.dispatchWorkgroups(x, y, z)`	Launches one binning or reduce stage	`x = ceil(featureCount / workgroup_size)`; cap at `maxComputeWorkgroupsPerDimension`
`GPUComputePassDescriptor.timestampWrites`	Records GPU-side stage timings	Pair with `GPUQuerySet` of type `"timestamp"` to attribute stalls to a stage
`GPUBufferUsage.STORAGE \| COPY_SRC`	Lets a buffer be both written and copied out	Declare both at creation to avoid implicit transitions mid-frame
`device.limits.maxComputeWorkgroupsPerDimension`	Hardware dispatch ceiling	At ~65,535 on most drivers, split very large tiles into multiple dispatches
`device.limits.maxStorageBufferBindingSize`	Largest bindable storage range	Caps how many features one pass can address; tile accordingly

Implementation Walkthrough

Step 1 — A reusable submission context

Rather than recreating an encoder per pass, wrap device.createCommandEncoder() in a context that tracks a monotonically increasing sequence value. The sequence number lets you correlate a resolved onSubmittedWorkDone() promise with the specific tile batch it belongs to, which matters when the camera moves faster than the GPU drains.

typescript

interface DispatchContext {
  encoder: GPUCommandEncoder;
  seq: number;
}

class ClusterDispatcher {
  private seq = 0;

  constructor(private device: GPUDevice) {}

  begin(): DispatchContext {
    // A fresh encoder per frame batch keeps command buffers small and
    // avoids retaining stale bind groups from a panned-away tile.
    return { encoder: this.device.createCommandEncoder(), seq: ++this.seq };
  }

  async submit(ctx: DispatchContext): Promise<number> {
    this.device.queue.submit([ctx.encoder.finish()]);
    // Resolve when *this* batch is done without blocking the main thread:
    // the await happens off the render loop's critical path.
    await this.device.queue.onSubmittedWorkDone();
    return ctx.seq;
  }
}

Step 2 — Chaining filter, bin, and reduce into one submission

The whole point of async dispatch is to record the entire clustering chain into a single command buffer so the GPU schedules it without per-stage round trips. The filter stage reuses geometry filtering with WGSL compute shaders to compact valid features; the reduce stage feeds spatial aggregation in GPU memory to produce centroids.

typescript

function recordClusterPasses(
  ctx: DispatchContext,
  pipelines: { filter: GPUComputePipeline; bin: GPUComputePipeline; reduce: GPUComputePipeline },
  binds: { filter: GPUBindGroup; bin: GPUBindGroup; reduce: GPUBindGroup },
  featureCount: number,
  gridRows: number,
): void {
  const WG = 256;
  const groups = Math.ceil(featureCount / WG);

  const pass = ctx.encoder.beginComputePass({ label: `cluster-${ctx.seq}` });

  // Pass 1: spatial + attribute predicate, atomic compaction of survivors.
  pass.setPipeline(pipelines.filter);
  pass.setBindGroup(0, binds.filter);
  pass.dispatchWorkgroups(groups);

  // Pass 2: scatter survivors into grid bins via atomicAdd on sums/counts.
  pass.setPipeline(pipelines.bin);
  pass.setBindGroup(0, binds.bin);
  pass.dispatchWorkgroups(groups);

  // Pass 3: normalize bin sums to centroids — one workgroup per bin row.
  pass.setPipeline(pipelines.reduce);
  pass.setBindGroup(0, binds.reduce);
  pass.dispatchWorkgroups(gridRows);

  pass.end();
}

Recording all three stages inside a single beginComputePass()/end() block keeps them in one pipeline barrier scope: the driver orders the dispatches as recorded, so the filter’s compacted index buffer is visible to the binning stage without an explicit fence. Only split into separate passes when a stage needs a different bind-group layout that cannot coexist in one pass.

Step 3 — Non-blocking readback of the surviving-feature count

The main thread usually needs the surviving-feature count to size the next frame’s draw call. Read it back through a dedicated mappable buffer so the in-flight STORAGE buffer is never stalled by a CPU map.

typescript

async function readClusterCount(
  device: GPUDevice,
  countBuffer: GPUBuffer,   // STORAGE | COPY_SRC, holds the atomic counter
  readBuffer: GPUBuffer,    // MAP_READ | COPY_DST, 4 bytes
): Promise<number> {
  const enc = device.createCommandEncoder();
  enc.copyBufferToBuffer(countBuffer, 0, readBuffer, 0, 4);
  device.queue.submit([enc.finish()]);

  // mapAsync resolves on a later tick; awaiting here never blocks paint
  // because the render loop already consumed last frame's centroids.
  await readBuffer.mapAsync(GPUMapMode.READ);
  const count = new Uint32Array(readBuffer.getMappedRange())[0];
  readBuffer.unmap();
  return count;
}

Step 4 — Driving it from the frame loop

The orchestration glue submits the current tile batch and uses the resolved sequence number to discard results from tiles the user has already panned past — the freshest-wins rule that keeps a fast-panning map responsive.

typescript

let latestSeq = 0;

function frame(dispatcher: ClusterDispatcher, scene: SceneState): void {
  const ctx = dispatcher.begin();
  recordClusterPasses(ctx, scene.pipelines, scene.binds, scene.featureCount, scene.gridRows);

  dispatcher.submit(ctx).then((seq) => {
    if (seq < latestSeq) return;     // stale batch — newer tiles superseded it
    latestSeq = seq;
    scene.markClustersReady(seq);    // next rAF binds the centroid buffer
  });

  requestAnimationFrame(() => frame(dispatcher, scene));
}

Backend teams can pre-partition datasets into spatial tiles and upload them as GPUBuffer chunks with copyBufferToBuffer() to minimize transfer overhead during dispatch. When integrating with GeoPandas or Dask, pad tile boundaries by a margin equal to the maximum clustering radius so features near an edge are not split across two bins. Dispatch tuning for the individual stages — workgroup occupancy, indirect dispatch, timestamp profiling — is covered separately in optimization flags for compute dispatches.

Memory and Performance Implications

The dominant VRAM cost in an asynchronous clustering pipeline is double-buffering. Because the GPU may still be reading frame N’s input while the CPU records frame N+1, the input and output buffers for consecutive batches cannot alias. Budget roughly 2 × (featureCount × strideBytes) for the in-flight feature buffers plus the grid buffers. For a typical 500k-feature viewport with a 32-byte stride (a vec4<f32> coordinate-plus-extent record), that is ~32 MB for feature storage alone, before grid and index buffers — comfortable on discrete GPUs but worth tiling on integrated hardware where maxStorageBufferBindingSize is lower.

Workgroup sizing of 256 invocations balances occupancy against register pressure for the scalar predicate work in the filter and bin stages; values that are not multiples of the hardware wavefront (32 on NVIDIA, 64 on AMD) waste lanes. The reduce stage is the inverse: it is bound by atomic contention on bin counters, not by occupancy, so spreading work across more workgroups can worsen throughput by increasing contention on hot bins. Profile both stages independently with timestampWrites rather than assuming a single workgroup size is optimal across the chain.

CPU/GPU transfer cost is dominated by the per-batch upload, not the readback. The readback in Step 3 moves 4 bytes; the upload moves the full tile. Streaming new geometry through pre-allocated buffers with copyBufferToBuffer() avoids per-frame createBuffer() calls, which are the most common source of allocator-driven jank in clustering pipelines. Keep buffers aligned to 16 bytes so vec4<f32> loads execute as single coalesced transactions; unaligned strides silently double bandwidth on AMD and NVIDIA.

Failure Modes and Diagnostics

GPUValidationError on submit — buffer usage mismatch. Recording a copyBufferToBuffer() from a buffer created without COPY_SRC, or binding a MAP_READ buffer as STORAGE, raises a validation error at submit time. Wrap the device in device.pushErrorScope("validation") around the recording block and await device.popErrorScope() after submit to capture the exact offending buffer. Always declare GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC together for any buffer you intend to read back.

OperationError on mapAsync() — buffer already mapped or destroyed. Calling mapAsync() on a readback buffer that is still mapped from a prior frame, or that was destroyed when a tile was evicted, rejects the promise. Guard the map with buffer.mapState === "unmapped" and ensure every mapAsync() is paired with unmap() before the next batch maps the same buffer; the freshest-wins guard in Step 4 prevents two batches from racing for the same readback buffer.

Device lost — GPUDevice.lost resolves. A driver reset, a tab backgrounded too long, or a runaway dispatch that exceeds the watchdog timeout causes the device to be lost; every subsequent submit silently no-ops. Subscribe to device.lost and rebuild the pipeline by re-running device initialization, then re-upload tile buffers. Treat device loss as a routing decision, not just an error — fall back to a CPU clustering path or a WebGL renderer through browser support and fallback routing strategies when re-acquisition fails repeatedly.

Atomic counter overflow — silent truncation. When a viewport’s surviving-feature count exceeds the capacity of the compacted index buffer, atomicAdd keeps incrementing past the array bound and writes are dropped or, worse, clobber adjacent memory. Size the index buffer to the worst-case feature count for the tile, and clamp the write with a bounds check (if (write_pos < capacity)) inside the WGSL kernel so overflow degrades to dropped features rather than corruption.

Frame pacing collapse — over-subscribed queue. Submitting a new batch every rAF tick without waiting for onSubmittedWorkDone() lets command buffers pile up faster than the GPU drains them, growing latency until the map feels unresponsive. Cap concurrent in-flight batches (commonly two) and skip recording a new batch while the previous sequence is unresolved.

Deeper Implementation References

This stage links to a focused companion page that takes a single sub-problem to copy-pasteable depth:

Writing a WGSL Kernel for Point-in-Polygon Clustering — ray-casting containment in WGSL with shared-memory vertex caching and workgroupBarrier() synchronization, for clustering against irregular administrative boundaries instead of a regular grid.

Spatial Compute Shaders & Geometry Pipelines — parent reference for the full GPU clustering architecture
Geometry Filtering with WGSL Compute Shaders — the upstream filter stage that compacts features before binning
Spatial Aggregation in GPU Memory — the downstream reduce stage that turns bins into centroids
Optimization Flags for Compute Dispatches — workgroup occupancy and timestamp profiling for each stage
Memory Alignment for Spatial Data Buffers — the 16-byte rules the binning stage depends on

Up: Spatial Compute Shaders & Geometry Pipelines