Optimizing Workgroup Sizes for Vector Geometry Filtering

A vector geometry filter that evaluates a bounding-box or winding-number predicate over millions of primitives is almost always memory-bound, not ALU-bound — which means the single @workgroup_size constant you hard-code into the WGSL entry point decides whether the pass runs at memory bandwidth or stalls at a fraction of it. Pick a size that under-fills the hardware wavefront and ALUs sit idle waiting on storage reads; pick one that over-subscribes the register file or workgroup memory and the scheduler launches fewer concurrent workgroups, collapsing latency hiding. The exact sub-problem this page solves is choosing local_size_x (and, for grid-indexed data, the 2D split) empirically per target GPU tier, rather than copying a number from a desktop NVIDIA tutorial that quietly halves throughput on an Adreno or Apple GPU. The work happens inside the filter pass described in geometry filtering with WGSL compute shaders; here we tune its dispatch geometry.

Runnable reference: a timestamp-query sweep harness

The only defensible way to size a workgroup is to measure the actual pass on the actual silicon. The harness below builds one filter pipeline per candidate size, runs each through a warmed GPUQuerySet of type 'timestamp', and returns the median dispatch duration so you can pick the winner. Because a compute pass is a distinct GPU program built on the compute versus render pipeline fundamentals, each size needs its own compiled pipeline — @workgroup_size is a pipeline-creation constant, not a dispatch argument.

typescript

// Sweep candidate workgroup sizes for a 1-D vector-geometry filter and
// return the median GPU-side duration (ns) for each. Spatial-data note:
// `bounds` is Structure-of-Arrays — one vec4<f32> AABB per primitive —
// so adjacent invocations read adjacent 16-byte records (coalesced).
async function sweepWorkgroupSizes(
  device: GPUDevice,
  bounds: GPUBuffer,        // array<vec4<f32>>, primitiveCount entries
  primitiveCount: number,
  candidates: number[] = [32, 64, 128, 256],
  reps = 50,
): Promise<Map<number, number>> {
  const querySet = device.createQuerySet({ type: "timestamp", count: 2 });
  const resolve = device.createBuffer({
    size: 16, usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC,
  });
  const readback = device.createBuffer({
    size: 16, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
  });

  // Output buffers the predicate writes into (sized once, reused per candidate).
  const survivors = device.createBuffer({
    size: primitiveCount * 4,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });
  const counter = device.createBuffer({
    size: 4, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });

  const results = new Map<number, number>();

  for (const size of candidates) {
    // Inject the candidate into WGSL as a compile-time literal.
    const module = device.createShaderModule({ code: filterWGSL(size) });
    const pipeline = device.createComputePipeline({
      layout: "auto",
      compute: { module, entryPoint: "filter_geometries" },
    });
    const bindGroup = device.createBindGroup({
      layout: pipeline.getBindGroupLayout(0),
      entries: [
        { binding: 0, resource: { buffer: bounds } },
        { binding: 1, resource: { buffer: survivors } },
        { binding: 2, resource: { buffer: counter } },
      ],
    });

    // Ceiling division pads the last workgroup; the shader bounds-checks it.
    const groups = Math.ceil(primitiveCount / size);
    const samples: number[] = [];

    for (let i = 0; i < reps; i++) {
      const enc = device.createCommandEncoder();
      const pass = enc.beginComputePass({
        timestampWrites: {
          querySet, beginningOfPassWriteIndex: 0, endOfPassWriteIndex: 1,
        },
      });
      pass.setPipeline(pipeline);
      pass.setBindGroup(0, bindGroup);
      pass.dispatchWorkgroups(groups, 1, 1);
      pass.end();
      enc.resolveQuerySet(querySet, 0, 2, resolve, 0);
      enc.copyBufferToBuffer(resolve, 0, readback, 0, 16);
      device.queue.submit([enc.finish()]);

      await readback.mapAsync(GPUMapMode.READ);
      const ts = new BigInt64Array(readback.getMappedRange());
      samples.push(Number(ts[1] - ts[0])); // nanoseconds, GPU clock
      readback.unmap();
    }

    samples.sort((a, b) => a - b);
    results.set(size, samples[Math.floor(reps / 2)]); // median
  }
  return results;
}

// 1-D filter kernel templated on the workgroup size.
function filterWGSL(workgroupSize: number): string {
  return /* wgsl */ `
@group(0) @binding(0) var<storage, read>       bounds   : array<vec4<f32>>;
@group(0) @binding(1) var<storage, read_write> survivors: array<u32>;
@group(0) @binding(2) var<storage, read_write> out_count: atomic<u32>;

@compute @workgroup_size(${workgroupSize}, 1, 1)
fn filter_geometries(@builtin(global_invocation_id) gid: vec3<u32>) {
  let i = gid.x;
  if (i >= arrayLength(&bounds)) { return; } // padded-dispatch guard
  let b = bounds[i];                          // coalesced 16-byte read
  // Example viewport predicate (replace with your spatial test):
  let hit = b.z >= 0.0 && b.x <= 1.0 && b.w >= 0.0 && b.y <= 1.0;
  if (hit) {
    let slot = atomicAdd(&out_count, 1u);     // stream-compaction index
    survivors[slot] = i;
  }
}
`;
}

Run the sweep once at startup against a representative slice of the live dataset, cache the winning size in localStorage keyed by adapter.info.vendor/architecture, and compile the production pipeline with it. Re-run only when the cached vendor key misses.

Parameter and configuration reference

Every tunable the harness and kernel expose, with guidance for spatial filtering workloads:

Parameter	Typical value	Spatial-workload guidance
`candidates` (sweep set)	`[32, 64, 128, 256]`	32/64 favour mobile and integrated GPUs; 128/256 favour discrete desktop parts. Always include 64 as the safe floor — it matches one AMD wavefront / two NVIDIA warps.
`@workgroup_size(x)`	128 (1-D default)	Multiples of 32 only; non-multiples waste lanes on the trailing partial warp. Start at 128 for contiguous geometry buffers.
2-D split (`16×8`, `8×16`)	grid/tile data	Use only when primitives carry 2-D locality (rasterized tiles, grid-indexed partitions) so neighbour reads hit the same cache lines.
`reps`	50	First 1–2 dispatches pay shader-warmup and allocation costs; the median over ≥50 discards them.
AABB record stride	16 bytes (`vec4<f32>`)	One cache line (128 B) holds 8 records, so 8 consecutive invocations share a fetch — the basis of coalescing. See memory alignment for spatial data buffers.
Workgroup count	`ceil(count / size)`	Ceiling division pads the dispatch; the in-shader `arrayLength` guard absorbs the overhang.
`var<workgroup>` budget	≤ `maxComputeWorkgroupStorageSize`	Default 16 KiB; staging geometry into shared memory must leave driver headroom or workgroups spill and occupancy drops.

The workgroup count is purely a function of primitive count and chosen size:

$$ \text{groups} = \left\lceil \frac{N_{\text{primitives}}}{\text{size}} \right\rceil $$

Pushing storage-buffer ceilings high enough to fit the AABB and survivor arrays is a device-limits concern — raise them as shown in configuring WebGPU adapter limits for large GeoJSON.

Failure modes specific to workgroup sizing

Atomic contention collapse. When most primitives pass the predicate, every invocation hits the same atomicAdd(&out_count, …), serializing wavefronts and cutting effective throughput by up to ~70%. Detection: the sweep shows duration barely improving (or worsening) as size grows, despite more parallelism. Fix: replace the global atomic with a workgroup-local counter plus a single atomicAdd per workgroup, or move to a prefix-sum (scan) compaction for high-pass-rate datasets.

Register-file exhaustion on mobile. A size of 256 that wins on desktop can run slower on an Adreno or Mali part because the predicate’s live registers exceed the per-lane file, forcing the scheduler to launch fewer workgroups. Detection: the per-vendor sweep shows the median climbing past 128 on mobile keys. Fix: never ship one global constant — honour the cached per-vendor winner; cap mobile candidates at 128.

Uncoalesced reads from interleaved layout. Packing geometry as Array-of-Structures ({x,y,z,attr} interleaved) means lane i and lane i+1 read addresses a full struct apart, scattering the cache-line fetch. Detection: throughput stays far below the device’s memory bandwidth regardless of workgroup size. Fix: repack as Structure-of-Arrays so the coordinate each lane needs is contiguous.

Out-of-bounds on the padded tail. Ceiling division always over-dispatches the final workgroup; omitting the if (i >= arrayLength(&bounds)) return; guard reads past the buffer and yields a GPUValidationError or garbage survivors. Detection: survivor count fluctuates run-to-run for a static dataset. Fix: keep the bounds guard as the first statement of the entry point.

Backend / Python interop note

The coalescing the chosen workgroup size relies on is only real if the buffer arrives packed correctly from the data tier. When AABBs are precomputed in a Python pipeline, emit them as a contiguous float32 SoA array and let the column stride match the WGSL vec4<f32> 16-byte stride exactly:

python

import numpy as np
import geopandas as gpd

gdf = gpd.read_parquet("network.parquet").to_crs(3857)  # one planar CRS
b = gdf.geometry.bounds                                  # minx,miny,maxx,maxy
# Structure-of-Arrays, row = one primitive's vec4<f32> AABB.
bounds = np.ascontiguousarray(
    b[["minx", "miny", "maxx", "maxy"]].to_numpy(dtype=np.float32)
)
bounds.tofile("bounds.f32")  # -> upload straight into the storage buffer

Reproject to a single projected CRS before computing bounds — mixing degrees and metres corrupts the predicate — and verify bounds.nbytes == primitiveCount * 16 so the host-side stride agrees with the shader’s view of the buffer. GeoParquet read via pyarrow/geopandas preserves this contiguity; a Python list-of-tuples does not, and will silently desync the alignment.

Geometry Filtering with WGSL Compute Shaders — the filter pass whose dispatch this page tunes
Using workgroup_id for Parallel Tile Processing — mapping the chosen workgroup geometry onto tiled spatial work
Reducing GPU Memory Fragmentation During Spatial Aggregation — where the compacted survivors flow next
Memory Alignment for Spatial Data Buffers — the 16-byte stride rules that make coalescing real
Browser Support & Fallback Routing Strategies — CPU/WebGL paths for devices without a usable timestamp-query or WebGPU

Up: Geometry Filtering with WGSL Compute Shaders

Optimizing Workgroup Sizes for Vector Geometry Filtering

Runnable reference: a timestamp-query sweep harness

Parameter and configuration reference

Failure modes specific to workgroup sizing

Backend / Python interop note

Related