Syncing Cesium 3D Tiles with WebGPU Compute Buffers

The specific sub-problem here is the hand-off boundary: a Cesium 3D Tile arrives as a compressed binary blob (b3dm, i3dm, or pnts), and a WGSL compute pass needs to read its vertices as a correctly-strided storage buffer in the same animation frame, without a main-thread stall and without a read-after-write hazard against the previous frame’s results. Get the staging-buffer alignment wrong and the second vertex onward folds into garbage; submit the copy and the dispatch in the wrong order and the compute pass reads last frame’s bytes; forget the completion fence and you evict a tile on the CPU while the GPU is still transforming it. This page is the end-to-end contract for that hand-off — the buffer usage flags, the copy ordering, the WGSL LOD kernel, the queue-submission fence, and the Python packer that produces bytes the GPU can memcpy straight into a storage buffer. It is one stage of CesiumJS mapping pipeline optimization, and it assumes you have already acquired a device per WebGPU device initialization for GIS workloads.

Runnable reference implementation

The flow below is the whole hand-off for a single tile within one frame: stage the decoded payload, copy it into device-local storage, dispatch the LOD-selection kernel against a compute pipeline, then fence completion before the CPU touches the tile again. Every byte-offset choice is driven by WebGPU’s memory alignment rules for spatial data buffers.

typescript

// Cesium tile payload, already decoded to interleaved vec4<f32> positions.
// Positions are promoted to vec4 (xyz + 1.0 in w) so each record is exactly
// 16 bytes — the vec3<f32> 12-vs-16 trap is removed before it reaches the GPU.
const ALIGN = 256; // copyBufferToBuffer offset/size granularity
const align256 = (n: number): number => Math.ceil(n / ALIGN) * ALIGN;

interface TileBuffers {
  storage: GPUBuffer;   // device-local compute input
  lodA: GPUBuffer;      // double-buffered output (even frames)
  lodB: GPUBuffer;      // double-buffered output (odd frames)
  vertexCount: number;
}

function uploadTile(device: GPUDevice, payload: ArrayBuffer): GPUBuffer {
  // Staging buffer: CPU-writable, copy source only. Size rounded to 256 so the
  // subsequent copyBufferToBuffer size is legal on every adapter.
  const staging = device.createBuffer({
    size: align256(payload.byteLength),
    usage: GPUBufferUsage.MAP_WRITE | GPUBufferUsage.COPY_SRC,
    mappedAtCreation: true,
  });
  new Uint8Array(staging.getMappedRange()).set(new Uint8Array(payload));
  staging.unmap();

  // Device-local storage: the compute pass reads this, never the staging buffer.
  const storage = device.createBuffer({
    size: align256(payload.byteLength),
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });

  const enc = device.createCommandEncoder();
  enc.copyBufferToBuffer(staging, 0, storage, 0, storage.size);
  device.queue.submit([enc.finish()]);
  staging.destroy(); // freed once the copy is enqueued; queue retains the data
  return storage;
}

// Per-frame: write transform -> dispatch LOD pass -> fence before eviction.
async function syncFrame(
  device: GPUDevice,
  pipeline: GPUComputePipeline,
  tiles: TileBuffers,
  transform: Float32Array, // 16 f32, column-major mat4x4
  frameParity: number,
): Promise<GPUBuffer> {
  const lodOut = frameParity % 2 === 0 ? tiles.lodA : tiles.lodB;

  // Uniform write MUST be submitted before the dispatch that reads it.
  device.queue.writeBuffer(transformUBO, 0, transform);

  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: tiles.storage } },
      { binding: 1, resource: { buffer: lodOut } },
      { binding: 2, resource: { buffer: transformUBO } },
    ],
  });

  const enc = device.createCommandEncoder();
  const pass = enc.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(Math.ceil(tiles.vertexCount / 64));
  pass.end();
  device.queue.submit([enc.finish()]);

  // Fence: resolve only after the GPU has finished this submission, so the
  // caller can safely evict the source tile or recycle the staging slot.
  await device.queue.onSubmittedWorkDone();
  return lodOut;
}

The companion WGSL kernel reads the storage buffer as array<vec4<f32>>, transforms each position into the working frame, and writes a per-vertex LOD index. The explicit arrayLength guard is mandatory — WGSL does not clamp out-of-bounds storage reads, and a dispatch rounded up to a multiple of 64 always over-runs the real vertex count.

wgsl

@group(0) @binding(0) var<storage, read>       tile_vertices: array<vec4<f32>>;
@group(0) @binding(1) var<storage, read_write> lod_indices:   array<u32>;
@group(0) @binding(2) var<uniform>             transform:     mat4x4<f32>;

fn select_lod(view_pos: vec4<f32>) -> u32 {
    let dist = length(view_pos.xyz);
    if (dist < 100.0)  { return 0u; } // near: full-resolution tile
    if (dist < 500.0)  { return 1u; } // mid:  decimated mesh
    return 2u;                         // far:  imposter / billboard
}

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
    let i = gid.x;
    if (i >= arrayLength(&tile_vertices)) { return; } // guard the tail workgroup
    lod_indices[i] = select_lod(transform * tile_vertices[i]);
}

LOD distance bands are evaluated against the view-space position $\lVert M \cdot p \rVert_2$, and the dispatch count is $\lceil V / 64 \rceil$ workgroups for $V$ vertices at a @workgroup_size of 64.

Parameter and configuration reference

Value	Where	Default / used	Spatial-workload guidance
`ALIGN` (copy granularity)	`align256`	256 bytes	Required granularity for `copyBufferToBuffer` offset and size. Round every staging and storage buffer up to it.
Vertex record stride	host packer + WGSL	16 bytes (`vec4<f32>`)	Promote `vec3` positions to `vec4`; carry intensity/classification in `w` instead of wasting the pad.
`@workgroup_size`	WGSL kernel	64	Safe portable value; a multiple of subgroup width on most adapters. Profile 64/128/256 for your device class.
Dispatch count	`dispatchWorkgroups`	`ceil(vertexCount / 64)`	Must cover the tail; pair with the `arrayLength` guard or the last group corrupts memory.
LOD bands	`select_lod`	100 / 500 m	Tie to camera velocity and tile geometric error, not fixed metres, for steady frame pacing.
Output buffers	`lodA` / `lodB`	2 (double-buffer)	Prevents read-after-write between the frame producing indices and the frame consuming them.
`maxStorageBufferBindingSize`	device limits	128 MiB	A single tile’s vertex buffer must fit one binding; chunk oversize tiles into multiple dispatches.
`maxBufferSize`	device limits	256 MiB	Caps total tile residency; drives the eviction policy fenced by `onSubmittedWorkDone`.

Failure modes specific to this sub-topic

Second-vertex corruption from a vec3 stride. Packing positions as 12-byte vec3<f32> while the shader reads array<vec4<f32>> shifts every record after the first; coordinates fold together with no validation error. Detection: geometry looks sheared, first vertex correct. Fix: pack vec4 records (16 bytes) on both sides, per the alignment reference.
Stale LOD indices (read-after-write hazard). Reusing a single output buffer means the render pass can sample indices the next compute pass is mid-overwrite. Detection: LOD popping or flicker correlated with camera motion. Fix: the lodA/lodB double-buffer keyed on frameParity, so the consumer reads last frame’s stable buffer.
Tile evicted while in flight (use-after-free). Freeing the source tile or staging slot before the dispatch completes invalidates bytes the GPU is still transforming. Detection: intermittent GPUValidationError or zeroed output under load. Fix: gate eviction on await device.queue.onSubmittedWorkDone() as in syncFrame.
Tail-workgroup over-read. A dispatch rounded up to ceil(V/64) invokes threads past the real vertex count; without the arrayLength guard WGSL reads out of bounds and writes garbage LOD slots. Detection: a handful of vertices at the buffer end carry wrong LOD. Fix: keep the if (i >= arrayLength(&tile_vertices)) { return; } early-out.
Device lost mid-stream. A driver reset or TDR timeout on an oversized tile dispatch invalidates every buffer and pipeline. Detection: device.lost resolves. Fix: re-acquire the device, re-upload resident tiles, and degrade through browser support and fallback routing strategies when re-acquisition fails.

Backend / Python interop note

The Python tile packer and the WGSL struct must agree byte-for-byte, or every fix above reappears on the wire. Pack positions as 16-byte vec4<f32> records (xyz + a 1.0 homogeneous lane, or a useful attribute), keep the same field order the shader binds, and pin little-endian so heterogeneous pre-serialization workers stay consistent. Compress with Zstd for transport, but the decompressed layout is what must match the storage buffer.

python

import struct
import zstandard as zstd

def pack_tile_payload(positions, w_lane):
    """Pack Cesium tile vertices as tightly strided vec4<f32> records.

    positions: flat list of floats, 3 per vertex (x, y, z in ECEF).
    w_lane:    per-vertex float carried in the 4th lane (intensity / class).
    Output decompresses to a byte-exact match for array<vec4<f32>> in WGSL.
    """
    n = len(positions) // 3
    assert len(w_lane) == n, "one w-lane value per vertex"
    flat = []
    for v in range(n):
        flat.extend(positions[v * 3 : v * 3 + 3])  # x, y, z
        flat.append(w_lane[v])                      # w -> vec4 lane
    raw = struct.pack(f"<{len(flat)}f", *flat)      # little-endian f32
    return zstd.ZstdCompressor(level=6).compress(raw)

When the source is GeoParquet or an Arrow RecordBatch, order the schema so the four lanes are contiguous float32 columns and convert with a zero-copy view rather than a per-column repack — the same byte-parity discipline that lets queue.writeBuffer or the staging copy land the upload without a host-side rebuild.

CesiumJS Mapping Pipeline Optimization — the full tile pipeline this hand-off plugs into: compute LOD culling, streaming, and telemetry.
Memory Alignment for Spatial Data Buffers — the WGSL stride and padding rules the tile struct depends on.
WebGPU Compute vs Render Pipeline Fundamentals — why the LOD pass is a compute program and how its output reaches the render pass.
deck.gl Layer Integration with WebGPU — converging several binary streams into one dispatch without framework mediation.
Browser Support and Fallback Routing Strategies — degrading to WebGL when a device is lost or unavailable.

Up: CesiumJS Mapping Pipeline Optimization

Syncing Cesium 3D Tiles with WebGPU Compute Buffers

Runnable reference implementation

Parameter and configuration reference

Failure modes specific to this sub-topic

Backend / Python interop note

Related