Reducing GPU Memory Fragmentation During Spatial Aggregation

The exact sub-problem here is allocator churn: a density-grid or centroid pipeline that calls createBuffer and destroy every time the map zooms, the viewport extent changes, or a new tile batch streams in. Each zoom level wants a differently sized grid — a 512×512 grid at zoom 6, a 4096×4096 grid at zoom 12 — and each variable-length point batch wants a differently sized input buffer. When those allocations are created and freed at the natural cadence of panning, the WebGPU driver’s heap accumulates non-contiguous free regions between still-live buffers. The visible symptoms are a creeping rise in createBuffer latency, a fragmentation index (allocated bytes over the sum of requested sizes) that climbs past 0.35, and eventually a GPUDevice.lost event mid-dispatch when a large grid can no longer find a contiguous span even though total free VRAM looks sufficient. This page is the buffer-lifetime companion to spatial aggregation in GPU memory: the aggregation math stays the same, but the buffers backing it are pooled, recycled, and sub-allocated instead of being created and thrown away each frame.

The fix is to stop letting the driver allocator see your churn at all. Allocate a small set of long-lived GPUBuffer objects once, size them to the worst case for each role, and sub-allocate aggregation regions out of them with explicit alignment. A density grid is independent of point count — it scales with resolution, not data volume — so the maximum grid for the deepest zoom can be reserved up front and reused at every shallower level. Variable-length point batches are absorbed by a ring of fixed-capacity staging buffers rather than a fresh allocation per batch.

Runnable reference implementation

The pool below manages two buffer classes: a fixed set of grid buffers sized to power-of-two cell counts, and a ring of staging buffers for streamed point batches. Both recycle their backing GPUBuffer across frames, so the driver allocator is touched only at startup. Sub-allocation enforces 256-byte alignment because that is the minStorageBufferOffsetAlignment floor that bind-group offsets must satisfy; the memory alignment for spatial data buffers reference covers why a misaligned offset silently shifts every cell off its stride.

typescript

// Round a byte length up to the next multiple of `align` (power of two).
const alignUp = (n: number, align: number): number => (n + align - 1) & ~(align - 1);

interface GridSlot {
  buffer: GPUBuffer;   // long-lived backing buffer for this resolution bucket
  cells: number;       // capacity in u32 cells
  inUse: boolean;
}

class AggregationPool {
  private grids: GridSlot[] = [];
  private staging: GPUBuffer[] = [];     // ring of fixed-capacity point buffers
  private ringHead = 0;
  // 256-byte alignment satisfies minStorageBufferOffsetAlignment on all tiers.
  private readonly OFFSET_ALIGN = 256;

  constructor(
    private device: GPUDevice,
    gridBuckets: number[],               // e.g. [512*512, 1024*1024, 4096*4096]
    private stagingBytes: number,        // worst-case point batch, e.g. 96 MiB
    ringDepth: number,                   // in-flight batches (commonly 2-3)
  ) {
    // Reserve one grid buffer per resolution bucket, largest first so the
    // biggest contiguous span is claimed before the heap fragments.
    for (const cells of [...gridBuckets].sort((a, b) => b - a)) {
      this.grids.push({
        buffer: device.createBuffer({
          size: alignUp(cells * 4, this.OFFSET_ALIGN),
          usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
        }),
        cells,
        inUse: false,
      });
    }
    // A ring of identical staging buffers absorbs variable-length batches
    // without ever calling createBuffer at frame cadence.
    for (let i = 0; i < ringDepth; i++) {
      this.staging.push(device.createBuffer({
        size: alignUp(stagingBytes, this.OFFSET_ALIGN),
        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
      }));
    }
  }

  // Claim the smallest grid that fits the requested cell count for this zoom.
  acquireGrid(cellCount: number): GridSlot {
    let best: GridSlot | undefined;
    for (const slot of this.grids) {
      if (!slot.inUse && slot.cells >= cellCount &&
          (!best || slot.cells < best.cells)) {
        best = slot;
      }
    }
    if (!best) throw new Error(`no grid bucket fits ${cellCount} cells`);
    best.inUse = true;
    return best;
  }

  releaseGrid(slot: GridSlot): void {
    slot.inUse = false;               // return to pool; never destroy()
  }

  // Hand out the next staging buffer in the ring. Depth must exceed the
  // number of batches the GPU can have in flight, or a batch overwrites
  // data the GPU is still reading.
  nextStaging(): GPUBuffer {
    const buf = this.staging[this.ringHead];
    this.ringHead = (this.ringHead + 1) % this.staging.length;
    return buf;
  }

  // Fragmentation index: requested bytes vs. reserved bytes. Stays flat
  // because nothing is ever freed mid-session.
  fragmentationIndex(requestedBytes: number): number {
    const reserved = this.grids.reduce((s, g) => s + g.buffer.size, 0) +
                     this.staging.reduce((s, b) => s + b.size, 0);
    return reserved > 0 ? 1 - requestedBytes / reserved : 0;
  }

  destroy(): void {
    for (const g of this.grids) g.buffer.destroy();
    for (const b of this.staging) b.destroy();
  }
}

Driving aggregation through the pool replaces every per-frame createBuffer/destroy pair with an acquire/release against pre-reserved memory. The grid is cleared with clearBuffer, not reallocated, between frames:

typescript

function aggregateFrame(
  device: GPUDevice,
  pool: AggregationPool,
  points: Float32Array,    // packed vec2<f32> positions for this batch
  gridW: number,
  gridH: number,
  pipeline: GPUComputePipeline,
  bindFor: (grid: GPUBuffer, src: GPUBuffer) => GPUBindGroup,
): GridSlot {
  const grid = pool.acquireGrid(gridW * gridH);
  const src = pool.nextStaging();

  // Upload into the recycled staging buffer; no allocation occurs here.
  device.queue.writeBuffer(src, 0, points);

  const enc = device.createCommandEncoder();
  enc.clearBuffer(grid.buffer);                 // zero reused cells, do not realloc
  const pass = enc.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindFor(grid.buffer, src));
  pass.dispatchWorkgroups(Math.ceil(points.length / 2 / 256));
  pass.end();
  device.queue.submit([enc.finish()]);

  return grid; // caller binds it to a render pass, then calls pool.releaseGrid()
}

For transient scratch buffers whose initial contents you control — a prefix-sum table or a compaction counter array — skip the separate writeBuffer by mapping at creation. This avoids a second staging allocation purely to zero-initialize:

typescript

const counters = device.createBuffer({
  size: alignUp(cellCount * 4, 256),
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  mappedAtCreation: true,            // write initial values without writeBuffer
});
new Uint32Array(counters.getMappedRange()).fill(0);
counters.unmap();

When fragmentation does build up inside a single long-lived buffer — survivors from geometry filtering with WGSL compute shaders leave gaps after some cells empty out on pan — run a compaction pass rather than reallocating. A WGSL kernel with an atomicAdd write pointer packs the live cells into a contiguous prefix of the same buffer, so the allocation never moves:

wgsl

@group(0) @binding(0) var<storage, read>       cells:    array<u32>;        // sparse, with gaps
@group(0) @binding(1) var<storage, read_write> packed:   array<u32>;        // contiguous output
@group(0) @binding(2) var<storage, read_write> writePos: atomic<u32>;       // shared cursor

@compute @workgroup_size(256)
fn compact(@builtin(global_invocation_id) gid: vec3<u32>) {
    let i = gid.x;
    if (i >= arrayLength(&cells)) { return; }   // guard the ragged tail
    let v = cells[i];
    if (v == 0u) { return; }                    // skip empty cells, no gap retained
    let dst = atomicAdd(&writePos, 1u);         // claim a contiguous slot
    packed[dst] = v;                            // pack survivors front-to-back
}

This is a render-independent compute program, so it stays off the draw path; the distinction matters when scheduling it, and the compute versus render pipeline fundamentals reference explains why a compute pass can run while the renderer keeps painting. Sequencing compaction so it never blocks a frame is the job of the async dispatch patterns for spatial clustering.

Parameter and configuration reference

Every tunable value in the implementation above, with guidance for geospatial workloads. These tables scroll horizontally on narrow viewports.

Parameter	Typical value	Spatial-workload guidance
`gridBuckets`	`[512², 1024², 4096²]`	One power-of-two bucket per zoom band you render; the deepest zoom sets the largest reservation. A 4096² `u32` grid is 64 MiB.
`stagingBytes`	80–96 MiB	Worst-case point batch: 10 M `vec2<f32>` positions is 80 MiB. Size to the densest tile, not the average.
`ringDepth`	2–3	Must exceed concurrent in-flight batches or a staging buffer is overwritten while the GPU still reads it.
`OFFSET_ALIGN`	256	`minStorageBufferOffsetAlignment` floor; bind-group offsets must be a multiple. Query `device.limits` to confirm.
`@workgroup_size`	256	Multiple of warp/wavefront width (32 NVIDIA, 64 AMD). Cap at `maxComputeWorkgroupSizeX`.
Fragmentation index target	< 0.30	`1 − requestedBytes / reservedBytes`. Over 0.35 means buckets are oversized or zoom bands too coarse.
`maxStorageBufferBindingSize`	≥ 128 MiB	A single 4096² grid plus a smoothed copy can approach this; request a higher limit or tile the grid.
Compaction trigger	gap ratio > 0.4	Run `compact` when more than ~40% of a buffer’s cells are empty, not every frame.

The 256-byte sub-allocation alignment is the constraint that bites first: a bind group whose buffer offset is not a multiple of minStorageBufferOffsetAlignment raises a GPUValidationError at bind time, so alignUp must wrap every offset, not just buffer sizes. For authoritative wording on offset alignment and buffer usage flags, consult the W3C WebGPU specification.

Failure modes specific to fragmentation control

Heap exhaustion despite free VRAM. Per-frame createBuffer/destroy across zoom levels leaves the heap riddled with gaps; a large grid request then fails to find a contiguous span even though total free memory exceeds it. Detection: createBuffer latency climbs frame over frame and the fragmentation index passes 0.35. Fix: route all grid and point allocations through the pool so nothing is freed mid-session.
Ring buffer overwrite (stale reads). A ringDepth smaller than the number of in-flight batches lets nextStaging hand back a buffer the GPU is still reading from the previous submission, corrupting the new frame’s input. Detection: intermittent garbage cells that vanish when panning slows. Fix: raise ringDepth above the in-flight batch count and gate new batches on queue.onSubmittedWorkDone().
Misaligned sub-allocation offset. Packing a region at an offset that is not a multiple of 256 bytes raises GPUValidationError: Offset ... is not a multiple of minStorageBufferOffsetAlignment at bind time. Detection: synchronous validation error on setBindGroup. Fix: apply alignUp(offset, OFFSET_ALIGN) to every sub-allocated offset, padding the preceding region.
Oversized bucket inflating the fragmentation index. Reserving a 4096² grid for a session that never zooms past 1024² wastes 60 MiB and pushes the index up even though there is zero churn. Detection: high fragmentation index with flat createBuffer latency. Fix: trim gridBuckets to the zoom bands actually rendered, or reserve the largest bucket lazily on first deep-zoom.
Device lost on an oversized compaction. A compaction dispatch over a 64 MiB grid can trip the driver’s TDR watchdog, invalidating every pooled buffer at once. Detection: device.lost resolves mid-pan. Fix: cap per-dispatch work and re-acquire the device through the browser support fallback routing strategies, rebuilding the pool against the fresh device created during WebGPU device initialization for GIS workloads.

Backend and Python interop note

The pool only stays fragmentation-free if the Python side produces batches that fit the reserved stagingBytes without re-sizing it. Pre-tile the dataset server-side so no single batch exceeds the staging capacity, and emit positions as a contiguous, C-ordered float32 array — a packed vec2<f32> layout — so a writeBuffer lands directly in the recycled buffer with no client-side repacking:

python

import geopandas as gpd
import numpy as np
import pyarrow.parquet as pq

# Read pre-projected points (Web Mercator metres) from a GeoParquet tile.
gdf = gpd.read_parquet("tile_z12_x655_y1583.parquet")

# Pack as interleaved x,y float32 — contiguous vec2<f32> for the GPU staging buffer.
xy = np.empty((len(gdf), 2), dtype=np.float32)
xy[:, 0] = gdf.geometry.x.to_numpy()
xy[:, 1] = gdf.geometry.y.to_numpy()
xy = np.ascontiguousarray(xy)            # guarantee a gap-free row-major buffer

assert xy.nbytes <= 96 * 1024 * 1024, "batch exceeds reserved stagingBytes"
payload = xy.tobytes()                    # ships straight into nextStaging()

Two backend-side rules keep the GPU pool stable. First, project on the server — aggregation hashes in a single linear space, and mixing geographic degrees with metres corrupts cell indices regardless of how clean the buffer layout is. Second, cap each GeoParquet tile’s point count to the stagingBytes budget so the ring never needs a larger buffer; if a dense tile would overflow, split it into sub-tiles rather than growing the reservation. Tuning the dispatch that consumes these batches — workgroup occupancy, indirect dispatch, timestamp profiling — is covered in optimization flags for compute dispatches. For Python-hosted rendering servers, the wgpu-py bindings expose the same createBuffer and writeBuffer surface, so this pool ports directly to a headless backend.

Spatial Aggregation in GPU Memory — the binning and centroid passes whose buffers this pool keeps fragmentation-free.
Geometry Filtering with WGSL Compute Shaders — the compaction pattern that leaves gaps this page reclaims.
Async Dispatch Patterns for Spatial Clustering — scheduling compaction and recycling off the render-blocking path.
Optimization Flags for Compute Dispatches — workgroup sizing that compounds with the pool’s reuse.
Memory Alignment for Spatial Data Buffers — the alignment rules behind sub-allocation offsets.

Up: Spatial Aggregation in GPU Memory

Reducing GPU Memory Fragmentation During Spatial Aggregation

Runnable reference implementation

Parameter and configuration reference

Failure modes specific to fragmentation control

Backend and Python interop note

Related