Geometry Filtering with WGSL Compute Shaders

Picture a vector tile server that hands the browser a 4-million-feature road network and a viewport that changes every animation frame. The classic answer — walk an R-tree on the main thread, push the survivors into a typed array, and re-upload — collapses the moment the coordinate array exceeds the default maxBufferSize or the predicate has to combine a spatial bound with an attribute filter (road class, year built, traffic volume). CPU-side spatial indexing cannot hold a 16 ms frame budget at that scale, and the JavaScript thread that owns the index is the same thread that owns rendering. The work this page addresses is moving the entire filter — predicate evaluation and the compaction that turns a sparse pass/fail mask into a dense list of survivors — onto the GPU as a WGSL compute pass, so the survivors never leave VRAM before they are drawn or aggregated.

This is one stage of Spatial Compute Shaders & Geometry Pipelines. Filtering sits at the front of that flow: it ingests the raw uploaded geometry and emits a compacted index buffer that downstream passes consume. Because a compute pass is a distinct GPU program — not a fragment shader masquerading as one — the mental model here builds directly on the compute versus render pipeline fundamentals, and it assumes you have already negotiated a GPUDevice during WebGPU device initialization for GIS workloads.

Prerequisites

Before implementing the filter described below, confirm the following are in place:

WebGPU device in hand. A resolved GPUDevice and its GPUQueue, acquired with appropriate requiredLimits. Filtering large datasets routinely pushes maxStorageBufferBindingSize and maxBufferSize, both of which must be requested up front because the defaults (128 MiB and 256 MiB respectively) are conservative.
Browser support. Chrome/Edge 113+ on desktop ship WebGPU by default; Chrome on Android 121+ and Safari 18 (macOS/iOS 18) provide production-grade support. For everything else you need a graceful degradation path — wire this filter behind the browser support and fallback routing strategies so unsupported clients drop back to a CPU spatial index.
A grasp of storage buffer alignment. WGSL aligns vec4<f32> to 16 bytes and u32 to 4 bytes; mismatched host-side packing silently corrupts predicate reads. The rules are covered in depth under memory alignment for spatial data buffers.
Data format assumptions. Geometry has been reduced to per-primitive axis-aligned bounding boxes (min_x, min_y, max_x, max_y) plus a packed u32 attribute word, serialized as Structure-of-Arrays. Coarse bbox filtering is the first stage; exact tests (point-in-polygon, winding number) run on the survivors.

API and alignment reference

The fields below are the ones that govern whether a filter pass is even constructible at GIS dataset sizes. Request the limits at device creation, then size every buffer against them.

Field / descriptor	Where it lives	Default	Spatial-filter relevance
`maxBufferSize`	`GPUSupportedLimits`	256 MiB	A 4 M-feature `vec4<f32>` bounds buffer is 64 MiB; the index buffer adds more. Request the ceiling.
`maxStorageBufferBindingSize`	`GPUSupportedLimits`	128 MiB	Caps a single `var<storage>` binding. Split oversize datasets into chunked dispatches if exceeded.
`maxComputeInvocationsPerWorkgroup`	`GPUSupportedLimits`	256	Upper bound on `@workgroup_size` product. 256 is the safe portable value.
`maxComputeWorkgroupSizeX`	`GPUSupportedLimits`	256	Per-dimension cap on a 1-D filter dispatch.
`usage: STORAGE`	`GPUBufferDescriptor`	—	Required for read/write compute access to bounds, attrs, and the index buffer.
`usage: COPY_SRC`	`GPUBufferDescriptor`	—	Lets the atomic count buffer be copied to a `MAP_READ` staging buffer for readback.
`@align(16)`	WGSL attribute	type rule	Forces `vec4<f32>` storage elements onto 16-byte strides; coalesces loads into one transaction.
`atomic<u32>`	WGSL type	—	The compaction counter; guarantees a unique write slot per surviving primitive.

WGSL storage buffers use runtime-sized arrays (array<vec4<f32>>), so the survivor count is read back rather than known at compile time. That single fact drives the architecture: you cannot pre-allocate an exactly-sized output, so you over-allocate the index buffer and track the real length in an atomic counter.

Implementation walkthrough

Step 1 — Lay out the data as Structure-of-Arrays

Serialize spatial data into Structure-of-Arrays (SoA) rather than Array-of-Structures so that one thread’s read of bounds[idx] and its neighbour’s read of bounds[idx+1] touch adjacent memory — the access pattern the GPU coalesces into a single transaction. Each primitive contributes one vec4<f32> of extents and one u32 of packed attribute flags, held in separate buffers. Python backend teams producing these payloads should emit the coordinate columns tightly packed so a Float32Array view maps straight onto the GPU layout with no client-side reshaping.

typescript

// Host-side: build the SoA payload and create storage buffers.
// boundsData holds [min_x, min_y, max_x, max_y] per primitive, contiguous.
const featureCount = boundsData.length / 4;

const boundsBuffer = device.createBuffer({
  size: boundsData.byteLength, // 16 bytes * featureCount, already 16-aligned
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(boundsBuffer, 0, boundsData);

const attrsBuffer = device.createBuffer({
  size: attrsData.byteLength, // 4 bytes * featureCount
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(attrsBuffer, 0, attrsData);

Keep vec4<f32> arrays on 16-byte strides — the buffer above is naturally aligned because each element is exactly 16 bytes. The instant you interleave a stray f32 into the same struct you invite implicit padding that inflates VRAM and breaks index arithmetic; that is precisely why bounds and attributes live in separate bindings here.

Step 2 — Allocate the compacted output and atomic counter

Because the survivor count is unknown until the pass runs, allocate the index buffer at the worst case (featureCount entries) and a separate four-byte buffer for the atomic counter. The counter buffer needs COPY_SRC so its value can be copied to a mappable staging buffer after the dispatch.

typescript

const validIndices = device.createBuffer({
  size: featureCount * 4, // u32 per potential survivor, worst case = all pass
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});

const countBuffer = device.createBuffer({
  size: 4, // single atomic<u32>
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
});

const filterUniform = device.createBuffer({
  size: 16, // vec4<f32> filter bbox: min_x, min_y, max_x, max_y
  usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
});

Step 3 — Write the WGSL filter kernel

The kernel runs one invocation per primitive. Every thread evaluates the predicate — there is no early branch around the expensive work, which keeps execution uniform across the wavefront — and only the write is conditional. Survivors claim a unique slot with atomicAdd, which returns the pre-increment value and so doubles as a scatter address. This is stream compaction: a sparse pass/fail mask collapses into a dense prefix of survivor indices.

wgsl

// Storage buffers use runtime-sized arrays; fixed lengths in a struct field
// would require compile-time constants, so each array is its own top-level binding.
@group(0) @binding(0) var<storage, read>       bounds: array<vec4<f32>>;
@group(0) @binding(1) var<storage, read>       attrs: array<u32>;
@group(0) @binding(2) var<storage, read_write> valid_indices: array<u32>;
@group(0) @binding(3) var<storage, read_write> count: atomic<u32>;
@group(0) @binding(4) var<uniform>             filter_bbox: vec4<f32>;

// Attribute mask: keep only features whose class bit is set.
const CLASS_MASK: u32 = 0x0000000fu;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
    let idx = gid.x;
    if (idx >= arrayLength(&bounds)) { return; } // guard the tail workgroup

    let b = bounds[idx];                 // (min_x, min_y, max_x, max_y)
    let f = filter_bbox;                 // viewport extent

    // Axis-aligned overlap test: cheap, branch-free, evaluated by every thread.
    let intersects = (b.x <= f.z) && (b.z >= f.x) &&
                     (b.y <= f.w) && (b.w >= f.y);

    // Compound predicate: spatial bound AND an attribute bit, single pass.
    let attr_ok = (attrs[idx] & CLASS_MASK) != 0u;

    if (intersects && attr_ok) {
        let write_pos = atomicAdd(&count, 1u); // unique, monotonically-increasing slot
        valid_indices[write_pos] = idx;        // scatter the survivor's original index
    }
}

Combining the spatial and attribute predicates in one pass is the point: bit-packing categorical flags into a u32 and testing them with & costs a single ALU op and avoids a second dispatch over the data. For predicates heavier than a bbox test — exact point-in-polygon, radial distance — this kernel is the coarse first stage and the survivors feed a precise second kernel. Atomic ordering across drivers follows the rules in the WGSL Specification, so concurrent atomicAdd results are unique even under maximum contention.

Step 4 — Build the pipeline and bind group

typescript

const module = device.createShaderModule({ code: WGSL_FILTER_SOURCE });
const pipeline = device.createComputePipeline({
  layout: "auto",
  compute: { module, entryPoint: "main" },
});

const bindGroup = device.createBindGroup({
  layout: pipeline.getBindGroupLayout(0),
  entries: [
    { binding: 0, resource: { buffer: boundsBuffer } },
    { binding: 1, resource: { buffer: attrsBuffer } },
    { binding: 2, resource: { buffer: validIndices } },
    { binding: 3, resource: { buffer: countBuffer } },
    { binding: 4, resource: { buffer: filterUniform } },
  ],
});

Step 5 — Dispatch, reset the counter, and read back

Reset the atomic counter to zero before every dispatch — a stale count corrupts the next frame’s compaction. Then size the dispatch by dividing the feature count by the workgroup size, rounding up, and let the in-kernel arrayLength guard absorb the ragged final workgroup.

typescript

function runFilter(filterBbox: Float32Array): void {
  device.queue.writeBuffer(filterUniform, 0, filterBbox);
  device.queue.writeBuffer(countBuffer, 0, new Uint32Array([0])); // reset per dispatch

  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  const workgroups = Math.ceil(featureCount / 256); // matches @workgroup_size(256)
  pass.dispatchWorkgroups(workgroups);
  pass.end();

  // Stage the count for readback; never map a STORAGE buffer directly.
  encoder.copyBufferToBuffer(countBuffer, 0, countStaging, 0, 4);
  device.queue.submit([encoder.finish()]);
}

async function readSurvivorCount(): Promise<number> {
  await device.queue.onSubmittedWorkDone();   // await GPU completion, not a CPU timer
  await countStaging.mapAsync(GPUMapMode.READ);
  const n = new Uint32Array(countStaging.getMappedRange())[0];
  countStaging.unmap();
  return n;
}

The choice of onSubmittedWorkDone() over polling matters: it resolves exactly when the queue drains, so you map the staging buffer at the earliest valid moment without spinning the main thread. Whenever practical, skip the readback entirely — bind validIndices straight as vertex input for a render pass, or route it into spatial aggregation in GPU memory for density grids and clustered centroids, so the survivors are consumed without a CPU round-trip. If the survivor count is needed on the CPU (to size an indirect draw, say), prefer drawIndexedIndirect fed from the count buffer over a synchronous read.

Step 6 — Stream incremental updates

For datasets that grow or pan, avoid reallocating buffers mid-frame. Pre-allocate at the worst case and stream new geometry chunks into the existing storage with copyBufferToBuffer, recycling allocations across frames.

typescript

function appendChunk(srcChunk: GPUBuffer, dstOffsetBytes: number, byteLength: number): void {
  const encoder = device.createCommandEncoder();
  encoder.copyBufferToBuffer(srcChunk, 0, boundsBuffer, dstOffsetBytes, byteLength);
  device.queue.submit([encoder.finish()]);
}

Memory and performance implications

A vec4<f32> bounds buffer costs 16 bytes per feature: 16 MiB per million features, 64 MiB at 4 M. The attribute buffer adds 4 bytes each, and the worst-case index buffer matches the bounds count in u32 slots — budget roughly 24 bytes per feature for the full filter set, before any downstream aggregation buffers. At 4 M features that is ~96 MiB resident, which is why maxStorageBufferBindingSize and maxBufferSize must be requested above their defaults and why oversize datasets get chunked into multiple dispatches.

@workgroup_size(256) is the portable default: it hits the maxComputeInvocationsPerWorkgroup floor that every conformant device guarantees, and it maps cleanly onto AMD wavefronts (64) and NVIDIA warps (32) without remainder. It is rarely optimal, though — atomic contention on the single counter rises with occupancy, and the right size depends on the survivor ratio and device class. The full occupancy and divergence analysis, including how to profile with timestamp-query, lives in optimizing workgroup sizes for vector geometry filtering.

Transfer cost is the other half of the budget. A writeBuffer of 64 MiB is not free; for incremental updates, copying only the changed chunk with copyBufferToBuffer keeps the per-frame transfer proportional to the delta, not the dataset. Keeping survivors on the GPU — bound directly as vertex input or aggregation source — removes the largest single cost, the readback of the index buffer over the PCIe bus.

A useful mental model for the atomic counter is the prefix it produces. With $N$ features and a survival probability $p$, the expected survivor count is $E[s] = pN$, and the index buffer is densely packed in $[0, E[s])$ while allocated for $N$ — so a low survival ratio means most of the allocation is dead space you pay VRAM for but a tight frame budget to fill. That asymmetry is what makes coarse-then-precise staging worthwhile: the cheap bbox pass shrinks $p$ for the expensive pass.

Failure modes and diagnostics

GPUValidationError on dispatch — buffer too small. The index buffer was sized below featureCount, or maxStorageBufferBindingSize was exceeded by a single binding. Detection: the error surfaces synchronously at submit. Fix: size validIndices at the worst case and split the dataset into chunked dispatches when one binding crosses the limit.
Counter overflow / out-of-bounds scatter. If every feature passes and the index buffer was under-allocated, atomicAdd returns slots past the array end. WGSL clamps the out-of-bounds write (silently dropping survivors) rather than crashing, so the symptom is missing features, not an error. Detection: read back the count and compare against the buffer capacity. Fix: allocate at worst case, or add an explicit if (write_pos < arrayLength(&valid_indices)) guard and treat a saturated counter as an overflow batch processed in a follow-up dispatch.
Stale survivors across frames. Forgetting to reset countBuffer to zero before a dispatch leaves the prior frame’s count, so new survivors scatter past the live region. Detection: survivor count grows monotonically frame over frame. Fix: zero the counter every dispatch (Step 5).
OperationError on mapAsync. Mapping a buffer that lacks MAP_READ, or one still in use by an in-flight submission. Detection: the returned promise rejects. Fix: read back through a dedicated MAP_READ | COPY_DST staging buffer and await onSubmittedWorkDone() first.
Device lost mid-dispatch. A driver reset or a TDR timeout from an oversized dispatch invalidates every buffer and pipeline. Detection: device.lost resolves with a reason. Fix: recreate the device and re-upload, gated through the fallback routing strategies, and cap per-dispatch work so a single pass stays well inside the watchdog window.
Silently wrong predicate from misalignment. A host-side struct that interleaves an f32 with the vec4<f32> bounds shifts every element off its 16-byte stride, so the shader reads garbage extents and the filter returns nonsense without erroring. Detection: visually wrong survivor set with no validation message. Fix: keep bounds in their own tightly packed buffer and verify strides against memory alignment for spatial data buffers.

Continue in this section

Optimizing Workgroup Sizes for Vector Geometry Filtering — profiling occupancy, atomic contention, and divergence to pick the dispatch dimensions for your device class and survival ratio.

Spatial Aggregation in GPU Memory — consume filtered survivors as input to density grids and clustered centroids without leaving VRAM.
Async Dispatch Patterns for Spatial Clustering — keep filter and clustering passes off the render-blocking path.
Optimization Flags for Compute Dispatches — pipeline and dispatch tuning that compounds with workgroup sizing.
Memory Alignment for Spatial Data Buffers — the alignment rules that keep SoA bounds buffers correct and coalesced.
WebGPU Compute vs Render Pipeline Fundamentals — where a filter pass sits relative to the rest of the GPU.

Up: Spatial Compute Shaders & Geometry Pipelines