Optimization Flags for WebGPU Compute Dispatches

A spatial visualization pans across a continent-scale point cloud, and every frame the same compute pass runs: filter the visible features, aggregate them into a density grid, hand the result to a render pass. The shader is correct, the data fits in VRAM, and yet the frame rate sags to 30 fps the moment the viewport crosses a dense urban core. The cause is almost never the algorithm — it is the dispatch configuration around it: a workgroup size that wastes half the SIMD lanes, a read_write buffer that forces the driver to insert barriers it does not need, an atomic counter that serializes a thousand workgroups onto one memory address, and a mapAsync readback stalling the main thread inside the animation frame. This page collects the optimization flags that govern compute dispatches and shows, with runnable code, how to set each one so the pass scales with the data instead of collapsing on the hardest tile.

This is one stage of Spatial Compute Shaders & Geometry Pipelines. Dispatch tuning is the cross-cutting concern that sits over the whole flow — it applies whether you are running the geometry filter, the in-memory aggregation pass, or the asynchronous clustering passes. Because a compute pass is a distinct GPU program with its own scheduling rules, the model here builds on the compute versus render pipeline fundamentals, and it assumes you negotiated the relevant limits during WebGPU device initialization for GIS workloads.

Prerequisites

Before tuning the flags below, confirm the following are in place:

A resolved device and queue. A valid GPUDevice and its GPUQueue, created with the requiredLimits your dataset needs. Dispatch tuning routinely pushes maxComputeInvocationsPerWorkgroup and maxComputeWorkgroupStorageSize, both of which must be requested up front because the portable defaults are conservative.
Browser support and a degradation path. Chrome/Edge 113+ on desktop, Chrome on Android 121+, and Safari 18 (macOS/iOS 18) ship production WebGPU. dispatchWorkgroupsIndirect and timestamp queries are not uniformly available, so route unsupported clients through the browser support and fallback routing strategies before relying on them.
Working knowledge of storage buffer alignment. Workgroup-shared arrays and indirect-dispatch argument structs both have strict layout rules — vec4<f32> aligns to 16 bytes, the indirect args are three tightly packed u32 values. The rules are covered under memory alignment for spatial data buffers.
A profiling baseline. You cannot tune what you cannot measure. Have either timestamp-query wired up or a stable wall-clock harness so each flag change is judged against a number, not a hunch. Workgroup sizing in particular is device-specific and must be measured, not assumed.
Data laid out as Structure-of-Arrays. Coalesced memory access — the payoff for most of these flags — depends on adjacent threads reading adjacent addresses. Array-of-Structures packing defeats it before any flag can help.

API and dispatch reference

The fields below are the levers this page tunes. Request the limits at device creation, then set each flag against the workload characteristics of your spatial data.

Flag / field	Where it lives	Default / value	Spatial-dispatch relevance
`@workgroup_size(x,y,z)`	WGSL entry attribute	required	Product should be a multiple of 32 (NVIDIA warp) or 64 (AMD/Intel wavefront) so SIMD lanes stay full on dense tiles.
`maxComputeInvocationsPerWorkgroup`	`GPUSupportedLimits`	256	Hard ceiling on the `@workgroup_size` product; 256 is the safe portable value.
`maxComputeWorkgroupStorageSize`	`GPUSupportedLimits`	16384 bytes	Caps total `var<workgroup>` bytes; sizes your per-tile reduction scratchpad.
`var<storage, read>`	WGSL address space	—	Declares one-way input; lets drivers skip the write-hazard barriers that `read_write` forces.
`var<storage, read_write>`	WGSL address space	—	Required only when the pass writes the buffer; over-using it inserts needless barriers.
`var<workgroup>`	WGSL address space	—	Maps to on-chip shared memory (LDS); the staging area for hierarchical atomic reduction.
`atomic<u32>`	WGSL type	—	Density counters and compaction indices; contention on a single global atomic serializes workgroups.
`dispatchWorkgroupsIndirect()`	`GPUComputePassEncoder`	—	GPU reads workgroup counts from a buffer, letting a prior pass size the next one — no CPU round-trip.
`GPUBufferUsage.INDIRECT`	`GPUBufferDescriptor`	—	Required usage flag on the buffer holding `{countX, countY, countZ}` for indirect dispatch.
`queue.onSubmittedWorkDone()`	`GPUQueue`	—	Resolves when submitted work finishes; the non-blocking signal for swapping buffers off the render path.

The indirect argument buffer has an exact shape: three contiguous u32 values (workgroupCountX, workgroupCountY, workgroupCountZ) at a 4-byte-aligned offset. Refer to the WebGPU specification on indirect dispatch for the precise buffer layout and alignment constraints.

Implementation walkthrough

Step 1 — Align workgroup size to the hardware warp

The foundation of an optimized compute pass begins at device.createComputePipeline(), and the single most consequential flag is the @workgroup_size declared on the WGSL entry point. GPUs schedule threads in fixed blocks — 32 on NVIDIA, 64 on AMD and most Intel parts. If the workgroup size is not a multiple of that block, the driver pads the final block with idle lanes, so a @workgroup_size(48) runs at the cost of 64 lanes while doing 48 lanes of work. For a one-dimensional sweep over a feature array, 64 or 256 are the portable sweet spots.

wgsl

// filter_kernel.wgsl — one thread per feature, sized to the wavefront.
@group(0) @binding(0) var<storage, read>       bounds : array<vec4<f32>>;
@group(0) @binding(1) var<storage, read_write> survivors : array<u32>;
@group(0) @binding(2) var<storage, read_write> count : atomic<u32>;

@compute @workgroup_size(64)            // multiple of 32 and 64 — full lanes on both vendors
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
  let idx = gid.x;
  if (idx >= arrayLength(&bounds)) { return; }   // guard the tail block
  let b = bounds[idx];
  if (b.z >= viewport.min_x && b.x <= viewport.max_x) {
    let slot = atomicAdd(&count, 1u);
    survivors[slot] = idx;
  }
}

The if (idx >= arrayLength(&bounds)) return; guard is mandatory: because feature counts are rarely an exact multiple of 64, the final workgroup overruns the array, and without the bound check those threads read past the buffer. The host dispatches ceil(featureCount / 64) workgroups.

Step 2 — Keep bind group layouts contiguous

@group and @binding indices should stay contiguous and stable across the shader modules in a multi-pass spatial flow. Fragmented or reshuffled binding layouts force the driver to rebuild descriptor tables between passes, and that rebuild cost compounds across thousands of tiles. Declare one bind group layout and reuse it for every pass that shares the same buffers.

typescript

// One layout, reused across filter / aggregate / compact passes.
const layout = device.createBindGroupLayout({
  entries: [
    { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
    { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
    { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
  ],
});

const pipeline = device.createComputePipeline({
  layout: device.createPipelineLayout({ bindGroupLayouts: [layout] }),
  compute: { module: device.createShaderModule({ code: filterWgsl }), entryPoint: "main" },
});

Note the type: "read-only-storage" on binding 0 — this is the host-side counterpart of the var<storage, read> declaration. Declaring the input read-only here lets the implementation reason about hazards and skip barriers it would otherwise insert defensively.

Step 3 — Restrict buffer access modifiers

WGSL storage access modifiers control how the hardware scheduler manages cache coherency and memory barriers. Declaring a buffer read_write when the pass only reads it causes some drivers to insert write-after-read barriers that never fire, costing throughput for nothing. Where data flow is one-directional — a bounds buffer that is only sampled, an attribute buffer that is only read — declare the more restrictive read access mode. Reserve read_write for the outputs that genuinely change.

Step 4 — Stage shared state in workgroup memory

Isolate per-tile intermediate state in var<workgroup> arrays, which map to fast on-chip shared memory (LDS / scratchpad) rather than global VRAM. For geometry-heavy passes this cuts L1 thrashing and lifts instruction throughput measurably, because each thread reads its neighbours’ partial results from chip-local memory instead of round-tripping to VRAM. Pre-filtering spatial extents on the GPU before this stage — the technique in geometry filtering with WGSL compute shaders — shrinks the working set further by culling out-of-bounds coordinates at the workgroup level before any shared-memory work begins.

Step 5 — Route atomics through hierarchical reduction

Spatial indexing and density aggregation lean on atomicAdd, atomicMax, and atomicCompareExchangeWeak. Each maps to a hardware instruction whose latency varies sharply by architecture, and unbounded contention on a single global address serializes every workgroup that touches it — the opposite of parallelism. The fix is to reduce within the workgroup first, then have exactly one thread commit the workgroup’s subtotal to global memory.

wgsl

// Hierarchical density aggregation: 4096 local adds collapse to one global atomic.
var<workgroup> tile_sum : atomic<u32>;

@compute @workgroup_size(256)
fn accumulate(@builtin(local_invocation_id)  lid : vec3<u32>,
              @builtin(global_invocation_id) gid : vec3<u32>) {
  if (lid.x == 0u) { atomicStore(&tile_sum, 0u); }   // one thread clears the scratchpad
  workgroupBarrier();                                // all threads see the cleared value

  if (gid.x < arrayLength(&weights)) {
    atomicAdd(&tile_sum, weights[gid.x]);            // contention confined to the workgroup
  }
  workgroupBarrier();                                // wait for every local add

  if (lid.x == 0u) {                                 // a single global write per workgroup
    atomicAdd(&grid[workgroup_index], atomicLoad(&tile_sum));
  }
}

The two workgroupBarrier() calls bracket the local accumulation so no thread reads a half-built subtotal. Pairing this with tile-based spatial hashing — so different workgroups own disjoint grid cells — removes cross-tile contention entirely. The using @workgroup_id for parallel tile processing reference shows how to derive that disjoint partition from the dispatch grid. For the exact memory-ordering guarantees of each atomic, see the WGSL atomic operations specification.

Step 6 — Let the GPU size its own dispatch with indirect dispatch

Static dispatchWorkgroups() calls assume uniform data distribution, which almost never holds for real GIS datasets: urban cores, dense point clouds, and sparse rural geometry produce severe load imbalance. dispatchWorkgroupsIndirect() reads the workgroup counts from a buffer marked GPUBufferUsage.INDIRECT, so a prior compute pass can compute occupancy and write the next pass’s dispatch size — no CPU branch, no round-trip.

typescript

// A prior pass writes {countX, countY, countZ} into this buffer.
const indirectBuffer = device.createBuffer({
  size: 3 * 4,                                   // three u32: countX, countY, countZ
  usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});

const pass = encoder.beginComputePass();
pass.setPipeline(clusterPipeline);
pass.setBindGroup(0, clusterBindGroup);
pass.dispatchWorkgroupsIndirect(indirectBuffer, 0);   // GPU reads the count it just computed
pass.end();

This pairs directly with async dispatch patterns for spatial clustering, where an occupancy map is computed asynchronously and fed into the clustering or spatial-join pass without the CPU ever learning the count.

Step 7 — Synchronize off the render-blocking path

Every flag above is wasted if CPU-GPU synchronization blocks the main thread. Backend teams routinely generate spatial indices, quadtree partitions, or simplified meshes, but streaming them to the GPU with a synchronous mapAsync() inside an animation frame stalls rendering. Instead, keep a ring of staging buffers and use queue.onSubmittedWorkDone() as a non-blocking completion signal.

typescript

// Double-buffered staging ring — never await a map inside the frame.
const RING = 3;
const staging = Array.from({ length: RING }, () =>
  device.createBuffer({ size: payloadBytes, usage: GPUBufferUsage.MAP_WRITE | GPUBufferUsage.COPY_SRC }));
let cursor = 0;

async function uploadNextChunk(chunk: Float32Array) {
  const buf = staging[cursor];
  cursor = (cursor + 1) % RING;                  // rotate so the GPU never reads a buffer we are writing
  await buf.mapAsync(GPUMapMode.WRITE);
  new Float32Array(buf.getMappedRange()).set(chunk);
  buf.unmap();

  const encoder = device.createCommandEncoder();
  encoder.copyBufferToBuffer(buf, 0, storageBuffer, 0, payloadBytes);
  device.queue.submit([encoder.finish()]);

  device.queue.onSubmittedWorkDone().then(() => {  // non-blocking: swap indices when safe
    swapSpatialIndexReference();
  });
}

Rotating indices modulo the ring length guarantees the CPU is never writing a buffer the GPU is still reading, and the onSubmittedWorkDone() callback lets indirect-dispatch buffers or spatial-index references be swapped only after the GPU is done — all without stalling the loop, preserving the 16.6 ms frame budget for interactive panning and zooming.

Memory and performance implications

The flags interact through two scarce resources: on-chip shared memory and global memory bandwidth. var<workgroup> storage is capped by maxComputeWorkgroupStorageSize (16 KiB on the portable baseline), so a 256-thread workgroup that keeps a vec4<f32> scratchpad per thread already spends 4 KiB — size the reduction buffer against that ceiling, not against convenience. Push past it and pipeline creation fails validation.

Workgroup sizing is a trade between occupancy and per-thread shared memory: a larger @workgroup_size amortizes the single global atomic write across more local adds (Step 5), but it also consumes more registers and shared memory per workgroup, which can lower how many workgroups the scheduler keeps resident. The only honest way to choose is to profile across your target device classes — the same kernel that peaks at @workgroup_size(256) on a discrete NVIDIA part may prefer 64 on an integrated GPU.

For VRAM, the dominant cost is usually the over-allocated output buffers (survivor index lists, density grids) rather than the inputs. Hierarchical reduction is what keeps bandwidth in check: collapsing thousands of per-feature atomicAdd calls into one global write per workgroup can drop global memory traffic by an order of magnitude on dense tiles, which is precisely where the naive version stalls. Indirect dispatch then ensures you only pay for the workgroups the data actually needs — a sparse rural tile dispatches a handful of workgroups instead of the worst-case grid.

We can quantify the dispatch count directly. For a one-dimensional sweep over $N$ features at workgroup size $W$, the launched workgroup count is

$$G = \left\lceil \frac{N}{W} \right\rceil$$

and the wasted lanes in the tail block are $GW - N$. Choosing $W$ as a divisor-friendly multiple of the warp width minimizes that waste while keeping enough resident workgroups to hide memory latency.

Failure modes and diagnostics

GPUValidationError — workgroup storage exceeds limit. A var<workgroup> array sized past maxComputeWorkgroupStorageSize fails at createComputePipeline(). Detection: the error surfaces synchronously at pipeline creation. Fix: shrink the per-thread scratchpad, lower @workgroup_size, or split the reduction into two passes.
GPUValidationError — indirect buffer missing INDIRECT usage. Calling dispatchWorkgroupsIndirect() on a buffer created without GPUBufferUsage.INDIRECT rejects at submit. Detection: synchronous validation error referencing the buffer usage. Fix: add INDIRECT to the descriptor and keep the {countX, countY, countZ} struct 4-byte aligned.
Silent throughput collapse from atomic contention. When every workgroup hammers one global counter, the pass still produces correct results but runs serialized — no error, just a flat frame-time cliff on dense tiles. Detection: frame time scales with feature density, not with the algorithm’s expected cost; a timestamp query pins the time inside the atomic-heavy pass. Fix: introduce workgroup-local reduction (Step 5) and disjoint tile partitioning.
Tail-block out-of-bounds reads. A dispatch of ceil(N/W) workgroups overruns the array in its last block; omitting the idx >= arrayLength(...) guard reads garbage or corrupts neighbouring data. Detection: nondeterministic wrong results near array boundaries. Fix: add the bound check as the first statement of the kernel.
OperationError on mapAsync. Mapping a staging buffer still in flight, or one lacking MAP_WRITE, rejects the promise and breaks the upload ring. Detection: the returned promise rejects. Fix: size the ring so a buffer is reused only after onSubmittedWorkDone() has resolved for its prior submission.
Device lost from an oversized dispatch. A single indirect dispatch that computes an enormous workgroup count can trip the driver watchdog (TDR), invalidating every buffer and pipeline. Detection: device.lost resolves with a reason. Fix: clamp the computed count in the producing shader, cap per-dispatch work, and recover through the fallback routing strategies.

Continue in this section

Using @workgroup_id for Parallel Tile Processing — derive disjoint tile partitions from the dispatch grid to isolate atomic hotspots and keep memory access coalesced.

Geometry Filtering with WGSL Compute Shaders — the filter pass whose workgroup sizing and access modifiers these flags tune.
Spatial Aggregation in GPU Memory — density grids and centroids where hierarchical atomic reduction pays off most.
Async Dispatch Patterns for Spatial Clustering — feeding occupancy maps into indirect dispatch without a CPU round-trip.
Memory Alignment for Spatial Data Buffers — the layout rules behind workgroup scratchpads and indirect argument structs.
WebGPU Compute vs Render Pipeline Fundamentals — where a compute dispatch sits relative to the render path it must not block.

Up: Spatial Compute Shaders & Geometry Pipelines