Spatial Aggregation in GPU Memory

A choropleth or heatmap that re-bins on every pan is the canonical case where CPU aggregation falls apart. Take ten million GPS pings and a 1024×1024 density grid: the classic loop walks each point, hashes it to a cell, and increments a counter — an O(N) scatter that the JavaScript thread cannot finish inside a 16 ms frame, and that competes for the same thread as rendering. The moment the coordinate array exceeds the default maxBufferSize, even the upload stalls. The work this page covers is moving the whole reduction — the spatial hash, the atomic accumulation into bins, the normalization and kernel smoothing — onto the GPU as a chain of WGSL compute passes, so the binned grid is built in VRAM and bound straight to a render pass without ever materializing the per-point intermediate on the CPU.

This is one stage of Spatial Compute Shaders & Geometry Pipelines. Aggregation sits downstream of filtering: it ingests the compacted survivors emitted by geometry filtering with WGSL compute shaders and emits a dense grid or centroid table that the renderer draws directly. Because a compute pass is a distinct GPU program rather than a fragment shader pressed into service, the model here builds on the compute versus render pipeline fundamentals, and it assumes you have already negotiated a GPUDevice during WebGPU device initialization for GIS workloads.

Prerequisites

Before implementing the patterns below, you should have:

A valid GPUDevice with maxStorageBufferBindingSize and maxComputeWorkgroupStorageSize inspected against your grid dimensions — adapter negotiation is covered in initializing WebGPU devices for GIS workloads.
Working knowledge of WGSL atomic<T> types — only atomic<u32> and atomic<i32> exist; there is no atomic float, which dictates the fixed-point encoding shown later.
A Structure-of-Arrays input layout. Aggregation consumes a tightly packed array<vec2<f32>> of positions (plus an optional array<f32> of weights), not an interleaved struct — the memory alignment for spatial data buffers reference explains why interleaving shifts every element off its 16-byte stride.
A browser that ships WebGPU — Chrome/Edge 113+, or Safari 18+ / Firefox behind their respective flags. Anything older must route through the browser support fallback strategies.
Coordinates already projected into a single linear space (Web Mercator metres or normalized tile space). Aggregation hashes in projected space; mixing geographic degrees with metres corrupts the cell index.

API and alignment reference

The fields and rules below govern every aggregation buffer. Atomic counters and the grid backing store are the constraints that bite first at GIS scale.

Field / rule	Value	Why it matters for aggregation
`GPUBufferUsage` for the grid	`STORAGE \| COPY_SRC \| COPY_DST`	`COPY_DST` clears the grid to zero each frame; `COPY_SRC` stages counts or centroids for optional readback.
WGSL `atomic<T>`	`u32` / `i32` only	No atomic float — accumulate weighted density as fixed-point `i32` and rescale on read.
`vec2<f32>` alignment	8-byte align, 8-byte size	A packed `array<vec2<f32>>` of positions stays coalesced; never pad it to `vec4`.
`f32` / `u32` alignment	4-byte align, 4-byte size	Per-cell counters in a flat `array<atomic<u32>>` pack with no implicit padding.
`var<workgroup>` budget	`maxComputeWorkgroupStorageSize` (≥ 16 KiB)	Bounds the size of a workgroup-local scratch grid used to absorb atomic contention.
`maxStorageBufferBindingSize`	≥ 128 MiB (default)	A 4096×4096 `u32` grid is 64 MiB — request a higher limit before exceeding it in one binding.
`@workgroup_size`	≤ `maxComputeInvocationsPerWorkgroup` (≥ 256)	256 is the portable default for the point-scatter pass; the grid passes tile by cell.
`maxBufferSize`	≥ 256 MiB (default)	Ten million `vec2<f32>` points is 80 MiB — past the default it must be requested or chunked.

These tables scroll horizontally on narrow viewports. For authoritative wording on storage access and atomic ordering, consult the W3C WebGPU specification and the WGSL specification.

Implementation walkthrough

The pipeline is two passes over one shared grid buffer: pass one scatters points into per-cell accumulators with atomics; pass two reads the raw grid, normalizes it, and applies kernel smoothing into the output the renderer samples. Everything stays resident in VRAM.

Step 1 — Allocate the grid and clear it per frame

The grid is a flat array<atomic<u32>> of gridW * gridH cells. Allocate it once at the worst case and zero it each frame with clearBuffer rather than reallocating — mid-frame createBuffer calls are what fragment VRAM across zoom levels.

typescript

const gridW = 1024;
const gridH = 1024;
const cellCount = gridW * gridH;

// One u32 accumulator per cell; flat row-major layout.
const gridBuffer = device.createBuffer({
  size: cellCount * 4,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
});

// Grid extent in projected space, fed to the hash. vec4: min_x, min_y, max_x, max_y.
const extentUniform = device.createBuffer({
  size: 16,
  usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
});

function clearGrid(encoder: GPUCommandEncoder): void {
  encoder.clearBuffer(gridBuffer); // zero every cell before scatter; stale counts corrupt the frame
}

Step 2 — Hash points into cells and accumulate atomically

The scatter kernel runs one invocation per point. It maps a vec2<f32> position to a cell via a linear spatial hash, then claims the cell with atomicAdd. Because there is no atomic float, weighted density is encoded as fixed-point: multiply the weight by a fixed scale and add as u32 (or i32 when weights can be negative).

wgsl

@group(0) @binding(0) var<storage, read>       positions: array<vec2<f32>>;
@group(0) @binding(1) var<storage, read>       weights:   array<f32>;
@group(0) @binding(2) var<storage, read_write> grid:      array<atomic<u32>>;
@group(0) @binding(3) var<uniform>             extent:    vec4<f32>; // min_x,min_y,max_x,max_y

const GRID_W: u32 = 1024u;
const GRID_H: u32 = 1024u;
const WEIGHT_SCALE: f32 = 256.0; // fixed-point factor: encode f32 weight as u32

@compute @workgroup_size(256)
fn scatter(@builtin(global_invocation_id) gid: vec3<u32>) {
    let i = gid.x;
    if (i >= arrayLength(&positions)) { return; } // guard the ragged tail workgroup

    let p = positions[i];
    let span = vec2<f32>(extent.z - extent.x, extent.w - extent.y);

    // Normalize to [0,1) then scale to cell coordinates. Points outside the
    // extent are dropped, not clamped, so the border bins are not inflated.
    let n = (p - extent.xy) / span;
    if (n.x < 0.0 || n.x >= 1.0 || n.y < 0.0 || n.y >= 1.0) { return; }

    let cx = u32(n.x * f32(GRID_W));
    let cy = u32(n.y * f32(GRID_H));
    let cell = cy * GRID_W + cx;                 // row-major flatten

    let amount = u32(weights[i] * WEIGHT_SCALE); // fixed-point weighted contribution
    atomicAdd(&grid[cell], amount);              // lock-free parallel accumulation
}

Dropping out-of-bounds points rather than clamping them is the spatial-data-specific choice: clamping piles every off-grid ping onto the edge cells and produces a bright false rim around the heatmap.

Step 3 — Cut atomic contention with workgroup-local scratch

At high density, thousands of points landing in the same cell serialize on one global atomic. For grids small enough to fit maxComputeWorkgroupStorageSize, accumulate into a var<workgroup> scratch grid first, workgroupBarrier(), then flush each cell to global memory with a single atomic per workgroup. This trades many contended global atomics for many cheap local ones plus one global flush.

wgsl

const TILE: u32 = 64u; // 64*64 u32 scratch = 16 KiB, within the guaranteed budget
var<workgroup> scratch: array<atomic<u32>, TILE * TILE>;

@compute @workgroup_size(256)
fn scatter_tiled(@builtin(global_invocation_id) gid: vec3<u32>,
                 @builtin(local_invocation_index) lid: u32) {
    // Zero the local scratch cooperatively before any thread accumulates.
    for (var s = lid; s < TILE * TILE; s += 256u) {
        atomicStore(&scratch[s], 0u);
    }
    workgroupBarrier(); // all scratch cells zeroed before scatter begins

    let i = gid.x;
    if (i < arrayLength(&positions)) {
        // ... hash to a tile-local cell index `local_cell` (same hash as Step 2,
        //     restricted to this workgroup's spatial tile) ...
        let local_cell = hash_local(positions[i]);
        atomicAdd(&scratch[local_cell], u32(weights[i] * WEIGHT_SCALE));
    }
    workgroupBarrier(); // every local accumulation visible before the flush

    // Flush local sums to the global grid: one global atomic per non-empty cell.
    for (var s = lid; s < TILE * TILE; s += 256u) {
        let v = atomicLoad(&scratch[s]);
        if (v != 0u) {
            atomicAdd(&grid[global_index(s)], v);
        }
    }
}

The two barriers are load-bearing: the first guarantees the scratch is zeroed before any thread writes, the second guarantees every local add is visible before the flush reads it. Skipping either yields nondeterministic undercounts. Atomic ordering across drivers follows the WGSL specification, so the flush is correct under maximum contention.

Step 4 — Normalize and smooth in a second pass

The raw grid holds fixed-point sums. A second pass, dispatched one invocation per cell, divides out WEIGHT_SCALE, applies a separable Gaussian to spread point mass into a continuous density field, and writes the result to the buffer or storage texture the renderer samples. A Gaussian kernel at radius $r$ weights neighbour cell $(dx, dy)$ by

$$ w(dx, dy) = \exp!\left(-\frac{dx^2 + dy^2}{2\sigma^2}\right) $$

normalized so the weights sum to one. Running it separably — horizontal then vertical — drops the cost from $O(r^2)$ to $O®$ per cell.

wgsl

@group(0) @binding(0) var<storage, read>       grid_in:  array<u32>;
@group(0) @binding(1) var<storage, read_write> grid_out: array<f32>;

const GRID_W: u32 = 1024u;
const GRID_H: u32 = 1024u;
const WEIGHT_SCALE: f32 = 256.0;
const RADIUS: i32 = 4;
const SIGMA: f32 = 2.0;

@compute @workgroup_size(16, 16)
fn smooth(@builtin(global_invocation_id) gid: vec3<u32>) {
    if (gid.x >= GRID_W || gid.y >= GRID_H) { return; } // guard the grid border

    var acc: f32 = 0.0;
    var norm: f32 = 0.0;
    let cx = i32(gid.x);
    let cy = i32(gid.y);

    for (var dy = -RADIUS; dy <= RADIUS; dy++) {
        for (var dx = -RADIUS; dx <= RADIUS; dx++) {
            let sx = cx + dx;
            let sy = cy + dy;
            if (sx < 0 || sx >= i32(GRID_W) || sy < 0 || sy >= i32(GRID_H)) { continue; }
            let w = exp(-f32(dx * dx + dy * dy) / (2.0 * SIGMA * SIGMA));
            let raw = f32(grid_in[u32(sy) * GRID_W + u32(sx)]) / WEIGHT_SCALE;
            acc += raw * w;
            norm += w;
        }
    }
    grid_out[gid.y * GRID_W + gid.x] = acc / max(norm, 1e-6); // guard divide-by-zero on empty cells
}

A 2D dispatch (@workgroup_size(16, 16)) is the natural fit for a grid pass: each invocation owns one cell and reads a local neighbourhood, which keeps the per-workgroup texture reads coalesced. For production, split the symmetric kernel into two separable 1D passes over a scratch grid.

Step 5 — Reduce to clustered centroids (optional)

When the output is points rather than a field — clustered markers on a map — accumulate a weighted position sum and a count per cell, then divide. Because there is no atomic float, encode each coordinate sum as fixed-point i32 so it survives atomicAdd. The weighted centroid of cell $c$ is

$$ \bar{\mathbf{x}}c = \frac{\sum{i \in c} w_i , \mathbf{x}i}{\sum{i \in c} w_i} $$

wgsl

@group(0) @binding(0) var<storage, read>       positions: array<vec2<f32>>;
@group(0) @binding(1) var<storage, read_write> sum_x:     array<atomic<i32>>;
@group(0) @binding(2) var<storage, read_write> sum_y:     array<atomic<i32>>;
@group(0) @binding(3) var<storage, read_write> counts:    array<atomic<u32>>;

const COORD_SCALE: f32 = 1000.0; // fixed-point: sub-metre precision in i32 metres

@compute @workgroup_size(256)
fn accumulate_centroids(@builtin(global_invocation_id) gid: vec3<u32>) {
    let i = gid.x;
    if (i >= arrayLength(&positions)) { return; }
    let p = positions[i];
    let cell = hash_cell(p);                          // same hash as the density pass
    atomicAdd(&sum_x[cell], i32(p.x * COORD_SCALE));  // fixed-point coordinate sum
    atomicAdd(&sum_y[cell], i32(p.y * COORD_SCALE));
    atomicAdd(&counts[cell], 1u);
}

A final divide pass turns the three buffers into a compact list of vec2<f32> centroids plus weights, which binds directly as instance attributes. The asynchronous orchestration that keeps this chain off the render-blocking path lives in async dispatch patterns for spatial clustering.

Step 6 — Hand off without a readback

The smoothed grid is already in VRAM, so bind it straight as a sampled texture or storage buffer for the render pass — never copy it to the CPU just to upload it again. Only read back when the CPU genuinely needs the values (cluster labels, a legend scale), and then through a dedicated staging buffer.

typescript

function aggregate(extent: Float32Array): void {
  device.queue.writeBuffer(extentUniform, 0, extent);

  const encoder = device.createCommandEncoder();
  clearGrid(encoder); // Step 1: zero before scatter

  const scatter = encoder.beginComputePass();
  scatter.setPipeline(scatterPipeline);
  scatter.setBindGroup(0, scatterBindGroup);
  scatter.dispatchWorkgroups(Math.ceil(pointCount / 256));
  scatter.end();

  const smooth = encoder.beginComputePass();
  smooth.setPipeline(smoothPipeline);
  smooth.setBindGroup(0, smoothBindGroup);
  smooth.dispatchWorkgroups(Math.ceil(gridW / 16), Math.ceil(gridH / 16));
  smooth.end();

  device.queue.submit([encoder.finish()]);
  // grid_out is now bindable as render input — no readback, no PCIe round-trip.
}

Recording both passes into one command encoder lets the driver schedule the scatter and the smooth back to back; the buffer dependency between them is honored without an explicit barrier in WebGPU’s submission model.

Memory and performance implications

A vec2<f32> position costs 8 bytes per point: 80 MiB at ten million points, before the optional 4-byte weight. The grid is independent of point count — a 1024×1024 u32 grid is 4 MiB, a 4096×4096 grid is 64 MiB — so VRAM scales with resolution, not data volume, which is what makes GPU aggregation cheap on the output side even for huge inputs. Budget a second grid of equal size for the smoothed output and, for centroids, three accumulator buffers of cellCount entries each.

The dominant cost is atomic contention, not arithmetic. With $N$ points over a grid of $C$ cells, the expected points per cell is $N/C$; when a few hot cells absorb a disproportionate share — a dense city centre against a sparse rural extent — those cells serialize and dictate the pass time. The workgroup-local scratch of Step 3 is the lever: it converts $k$ contended global atomics on a hot cell into $k$ local atomics plus one global flush, and the gain grows with the skew of the distribution. @workgroup_size(256) is the portable default for the scatter pass; the 16×16 two-dimensional shape fits the grid passes. Picking the size that actually maximizes occupancy for your survival ratio and device class is the subject of the dispatch-tuning work in optimization flags for compute dispatches.

Transfer cost is the other half. Uploading 80 MiB of points with writeBuffer is not free; for streaming feeds, append only the changed chunk with copyBufferToBuffer and recycle the allocation across frames rather than reuploading the whole set. Keeping the grid on the GPU — sampled directly by the renderer — removes the single largest cost, the readback of the grid over the PCIe bus. When repeated reallocation across zoom levels does start fragmenting VRAM, the ring-buffer and compaction strategy in reducing GPU memory fragmentation during spatial aggregation keeps the footprint predictable.

Failure modes and diagnostics

GPUValidationError on dispatch — grid binding too large. A 4096×4096 u32 grid is 64 MiB; two of them plus accumulators can cross maxStorageBufferBindingSize in one binding. Detection: the error surfaces synchronously at submit. Fix: request a higher limit at device creation, or tile the grid into multiple bindings and dispatch per tile.
Stale density across frames. Forgetting clearBuffer on the grid leaves the previous frame’s counts, so the heatmap only ever brightens. Detection: density grows monotonically and never decays on pan. Fix: zero the grid every frame before the scatter pass (Step 1).
Atomic counter overflow. Fixed-point weights times a hot cell’s point count can exceed u32 range, wrapping the accumulator to a near-zero value — the hottest cell reads as cold. Detection: the densest region shows a dark hole. Fix: lower WEIGHT_SCALE, or accumulate into atomic<u32> with a saturating check, or pre-aggregate with the workgroup scratch so the global add is a partial sum.
Bright false rim around the grid. Clamping out-of-bounds points to the nearest edge cell instead of dropping them piles every off-extent ping onto the border. Detection: a hot frame around an otherwise sparse map. Fix: drop points outside [0,1) in normalized space (Step 2), never clamp.
Nondeterministic undercounts. A missing workgroupBarrier() between scratch-zero and scatter, or between scatter and flush, lets some adds race the read. Detection: cell totals differ run to run for identical input. Fix: keep both barriers in the tiled scatter (Step 3).
Silently wrong bins from misalignment. Interleaving an f32 weight into the position buffer shifts every vec2<f32> off its 8-byte stride, so the hash reads garbage coordinates and bins points into the wrong cells without erroring. Detection: a plausible-looking but spatially scrambled heatmap. Fix: keep positions and weights in separate tightly packed buffers and verify strides against memory alignment for spatial data buffers.
Device lost mid-dispatch. An oversized grid pass can trip a driver TDR watchdog, invalidating every buffer and pipeline. Detection: device.lost resolves with a reason. Fix: recreate the device and re-upload through the fallback routing strategies, and cap per-dispatch work so a single pass stays inside the watchdog window.

Continue in this section

Reducing GPU Memory Fragmentation During Spatial Aggregation — ring buffers, compaction passes, and GPUBuffer lifetime management that keep VRAM predictable as grids resize across zoom levels.

Geometry Filtering with WGSL Compute Shaders — the filtering stage that feeds compacted survivors into the aggregation passes here.
Async Dispatch Patterns for Spatial Clustering — keep the scatter, normalize, and centroid passes off the render-blocking path.
Optimization Flags for Compute Dispatches — dispatch and pipeline tuning that compounds with the workgroup sizing chosen here.
Memory Alignment for Spatial Data Buffers — the alignment rules that keep packed position and weight buffers correct and coalesced.
deck.gl Layer Integration with WebGPU — binding aggregated grids and centroids as layer data without leaving VRAM.

Up: Spatial Compute Shaders & Geometry Pipelines