Memory Alignment for Spatial Data Buffers

A point cloud of forty million LiDAR returns, packed three floats to a point, looks like 480 MB of geometry. Upload it to a WebGPU storage buffer under the naive assumption that vec3<f32> occupies twelve contiguous bytes and the rasterizer reads garbage: every record after the first is shifted, coordinates fold into each other, and the validation layer may never complain because nothing about the buffer is technically illegal — it is just laid out for a memory model the GPU does not use. WebGPU replaces WebGL’s driver-dependent padding heuristics with a deterministic, specification-defined layout derived from the WGSL type system, and that determinism is the whole point: get the alignment math right once and the same bytes are read identically across every conformant adapter. This reference covers the alignment and stride rules that govern uniform and storage buffers of spatial data, the structure-of-arrays layouts that keep dense geographic records dense, the layout parity required between a compute pipeline and the render pass that consumes its output, and the Python-side serialization that must produce byte-identical records before upload. It sits within the broader WebGPU Architecture for Spatial Visualization section, which frames where buffer layout fits in the full device-to-canvas pipeline.

Prerequisites

This reference assumes you are comfortable with the explicit WebGPU resource model and are debugging or designing a real spatial layout, not learning the API from scratch. Before working through it you should have:

A WebGPU device successfully acquired and validated — adapter selection, limit probing, and optional-feature negotiation are covered in initializing WebGPU devices for GIS workloads. The alignment rules here are independent of which adapter you got, but the maxStorageBufferBindingSize and maxBufferSize limits that bound your buffers come from device negotiation.
A current implementation: Chrome/Edge 113+ on Windows/macOS/ChromeOS, or Chrome 121+ on Linux behind a flag; Safari 18+ (macOS Sequoia, iOS 18); Firefox 141+. Behaviour described here follows the stable W3C WebGPU and WGSL specifications, not vendor extensions. Where a target platform cannot supply a device at all, layout is moot — route through fallback routing strategies instead.
Working knowledge of WGSL scalar and vector types (f32, u32, i32, vec2/vec3/vec4, mat4x4) and the difference between var<uniform> and var<storage> address spaces, which carry different layout constraints.
Spatial data already in typed arrays or convertible to them: Float32Array/Uint32Array views over an ArrayBuffer, or a Python pipeline emitting numpy structured arrays, GeoParquet, or Arrow record batches that you control the schema of.

WGSL alignment and stride reference

Every layout decision reduces to four rules from the WGSL specification: each type has an AlignOf and a SizeOf; a member’s byte offset is rounded up to its alignment; a struct’s own alignment is the largest alignment of its members; and a struct’s size is its last member’s end offset rounded up to the struct alignment. The uniform address space adds a stricter constraint — array element stride and the struct stride of array elements must be a multiple of 16 bytes — which is why uniform buffers waste more space than storage buffers for the same data. The table below is the working subset you need for spatial records.

WGSL type	Align (bytes)	Size (bytes)	Spatial-layout note
`f32`, `u32`, `i32`	4	4	A single elevation, feature id, or LOD index.
`vec2<f32>`	8	8	A planar `(lon, lat)` or `(x, y)` pair — packs tightly.
`vec3<f32>`	16	12	A 3D position; 4 bytes of trailing pad to the next 16.
`vec4<f32>`	16	16	A position+1, colour, or padded 3-vector — no waste.
`mat4x4<f32>`	16	64	One projection/view/model transform.
`array<T, N>` (storage)	align(T)	N × roundUp(align(T), size(T))	Stride = element size rounded to its own align.
`array<T, N>` (uniform)	16	N × roundUp(16, size(T))	Element stride forced to a multiple of 16.
`struct`	max member align	last offset + size, rounded to struct align	Stride padded; `array<Struct>` inherits the padded stride.

The single rule that causes the most silent corruption is the vec3<f32> one: it is the only common spatial type whose size (12) is smaller than its alignment (16). Any host code that assumes a three-float position is twelve packed bytes will desynchronize from the GPU on the second element. The standard fixes are to promote it to vec4<f32> and carry a useful value (intensity, classification, timestamp) in the fourth lane, or to split the components across a structure-of-arrays layout where each axis is its own tightly packed array<f32>.

Implementation walkthrough

Step 1 — Lay out the struct with explicit offsets

Start from the data, not the shader. A tile-index record needs two corners, an elevation range, and a level. Written as a WGSL struct with the byte offsets computed by hand, the padding requirement becomes visible rather than implicit.

wgsl

struct SpatialTile {
    // Three vec2<f32> pairs pack contiguously at 8-byte alignment.
    min_coord:       vec2<f32>, // offset  0, size 8
    max_coord:       vec2<f32>, // offset  8, size 8
    elevation_range: vec2<f32>, // offset 16, size 8
    tile_level:      u32,       // offset 24, size 4
    // Explicit pad makes the 32-byte stride a deliberate choice, not an
    // accident the compiler inserts where you cannot see it.
    _padding:        u32,       // offset 28, size 4
};                              // AlignOf = 8, SizeOf = 32

@group(0) @binding(0) var<storage, read> tiles: array<SpatialTile>;

The struct’s alignment is 8 (the largest member alignment, from the vec2<f32>s), and its size rounds to 32. Because every member offset is already a multiple of its own alignment, the compiler inserts nothing implicitly — what you see is the exact stride. That property is what lets the Python and TypeScript sides reproduce the layout without guessing.

Step 2 — Size and create the buffer from the stride, not a literal

Never hard-code a byte count. Derive the buffer size from the element stride and the record count so a later field change cannot silently desynchronize the allocation.

typescript

const TILE_STRIDE = 32;             // bytes per SpatialTile, matches WGSL SizeOf
const tileCount = tiles.length;

const tileBuffer = device.createBuffer({
  // Storage stride need only respect the struct's own alignment (8 here),
  // so 32 is already valid without rounding to 16.
  size: TILE_STRIDE * tileCount,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});

For a var<uniform> block the calculation differs: array element stride is forced to a multiple of 16, so a 32-byte struct is already conformant but a 24-byte one would be padded to 32 by the rules above. When in doubt, compute roundUp(16, sizeof) for uniform arrays and roundUp(alignof, sizeof) for storage arrays.

Step 3 — Pack the typed array against the same offsets

On the JavaScript side, write into a single ArrayBuffer using a DataView (or strided typed-array views) at exactly the offsets the WGSL struct declares. A DataView makes the offsets explicit and lets you pin little-endian order defensively.

typescript

function packTiles(records: TileRecord[]): ArrayBuffer {
  const buffer = new ArrayBuffer(TILE_STRIDE * records.length);
  const view = new DataView(buffer);
  records.forEach((r, i) => {
    const base = i * TILE_STRIDE;
    view.setFloat32(base + 0,  r.minX,   true); // little-endian
    view.setFloat32(base + 4,  r.minY,   true);
    view.setFloat32(base + 8,  r.maxX,   true);
    view.setFloat32(base + 12, r.maxY,   true);
    view.setFloat32(base + 16, r.minZ,   true);
    view.setFloat32(base + 20, r.maxZ,   true);
    view.setUint32 (base + 24, r.level,  true);
    // bytes 28-31 left zero — the explicit _padding lane.
  });
  return buffer;
}

device.queue.writeBuffer(tileBuffer, 0, packTiles(records));

Step 4 — Choose structure-of-arrays for large point clouds

For uniform per-record structs the array-of-structures layout above is fine. For million-point clouds where the bottleneck is vertex-fetch bandwidth, a structure-of-arrays (SoA) layout is almost always faster: store every X in one tightly packed array<f32>, every Y in another, and so on. Each array is dense (no per-element padding, because f32 size equals its alignment), and a compute or vertex shader reading only the coordinates it needs touches fewer cache lines.

wgsl

// SoA: each attribute is its own dense buffer, indexed by point id.
@group(0) @binding(0) var<storage, read> xs:         array<f32>;
@group(0) @binding(1) var<storage, read> ys:         array<f32>;
@group(0) @binding(2) var<storage, read> zs:         array<f32>;
@group(0) @binding(3) var<storage, read> intensity:  array<u32>;

@vertex
fn vs_main(@builtin(vertex_index) i: u32) -> @builtin(position) vec4<f32> {
    let p = vec3<f32>(xs[i], ys[i], zs[i]);
    // ...apply transform; intensity sampled only in the fragment stage.
    return vec4<f32>(p, 1.0);
}

The trade-off is binding-slot pressure (each attribute consumes a @binding) against bandwidth: a heatmap pass that needs only xs/ys never pages in zs or intensity, which array-of-structures cannot avoid because the fields are interleaved within each cache line.

Step 5 — Keep compute and render layouts identical

When a compute pass transforms geometry that a later render pass draws, both stages must agree on the struct byte-for-byte. The struct definition belongs in a shared WGSL module imported by both pipelines, and the buffer is created once with both usage flags. Detail on when to split work across the two pipeline kinds lives in the compute vs render pipeline fundamentals reference; the alignment-specific rule is simply that a divergent field order or pad between the two modules produces either a GPUValidationError at bind-group creation or, worse, silent misreads.

typescript

const transformed = device.createBuffer({
  size: TILE_STRIDE * tileCount,
  // One buffer, both stages: the compute pass writes, the render pass reads.
  usage:
    GPUBufferUsage.STORAGE |
    GPUBufferUsage.VERTEX  |
    GPUBufferUsage.COPY_SRC,
});

Memory and performance implications

Alignment is not only a correctness rule; it sets the VRAM cost and the bandwidth ceiling of a spatial application. The decisive number is how many useful bytes survive per record after padding.

The vec3 tax. Storing (lon, lat, elevation) as a vec3<f32> in an array consumes 16 bytes per point but carries only 12 of payload — a 33% inflation. Across forty million points that is 160 MB of VRAM spent on padding alone. Promoting to vec4<f32> and filling the fourth lane with intensity or classification recovers the waste as data; SoA eliminates it entirely.
Uniform vs storage cost. The uniform address space rounds array element stride up to 16 bytes, so an array<f32> in a uniform block costs 16 bytes per element — four times the storage-buffer cost. Per-frame transform blocks belong in uniforms (small, frequently updated); bulk geometry belongs in storage buffers where the dense stride applies. The narrower task of laying out that per-frame transform block is covered in structuring uniform buffers for coordinate alignment.
Cache-line coalescing. GPUs fetch memory in 32- or 64-byte lines. A 16-byte-aligned, densely packed struct lets adjacent invocations read adjacent records without straddling lines; a misaligned or sparsely padded layout forces partial reads and inflates effective bandwidth. In point-cloud streaming, packing attributes into aligned structs (or dense SoA arrays) typically cuts vertex-fetch latency by 15–30% over tightly packed but unaligned layouts.
Transfer cost. queue.writeBuffer() copies host bytes verbatim — there is no host-side repack on the GPU. If the host array is already in GPU layout (the goal of the serialization section below), upload is a straight memcpy; if it is not, you pay a CPU pass to repack before every upload, which dominates the cost for streaming workloads.
Workgroup sizing. A compute pass that initializes or transforms these buffers should size its workgroup to a multiple of the hardware subgroup width (commonly 32 or 64) so that aligned, coalesced reads map onto full waves. Dispatch ceil(tileCount / workgroupSize) workgroups and bounds-check the tail invocation against the record count.

Failure modes and diagnostics

GPUValidationError at createBindGroup / createComputePipeline. The struct declared in the shader disagrees with the binding’s expectations — a wrong field order, a missing pad, or a buffer too small for the declared array. Wrap resource creation in device.pushErrorScope('validation') … await device.popErrorScope() during development so the message names the offending binding. This is the good failure: it catches layout bugs before any draw.
Silent coordinate drift (no error). The most dangerous mode. Host offsets and WGSL offsets diverge — typically the vec3 pad omitted on the host, or a u32 written where the struct expects padding. Nothing is illegal, so no error fires; the symptom is geometry that shifts progressively along the array. Detect it by reading the buffer back: stage into a MAP_READ | COPY_DST buffer with copyBufferToBuffer, await buffer.mapAsync(GPUMapMode.READ), and compare the mapped bytes against the expected offsets of a known record.
GPUOutOfMemoryError on allocation. A padded layout pushed a large tile or point buffer past maxStorageBufferBindingSize or maxBufferSize. Recover by switching to a denser layout (SoA or vec4 packing), splitting the buffer across multiple bindings, or shedding resolution. Probe the limits during device initialization rather than discovering them at allocation time.
Device lost during a long upload. A driver reset or GPU reclamation rejects the device.lost promise mid-stream. Treat it as recoverable: rebuild the device and re-upload from the host arrays, which is cheap precisely because they are already in GPU layout.

A durable guard against the silent-drift mode is a single canonical schema shared across WGSL, TypeScript, and the backend, plus a CI check that parses the WGSL source for declared offsets and compares them against the backend dtype and the TypeScript packer. An assertion as small as the one below catches the common stride mistakes before they reach the rasterizer.

typescript

console.assert(
  TILE_STRIDE % 8 === 0,                 // struct AlignOf for SpatialTile
  'Tile stride violates struct alignment',
);
console.assert(
  tileBuffer.size === TILE_STRIDE * tileCount,
  'Buffer size desynchronized from record count',
);

Backend serialization parity

Python backends emitting spatial arrays must produce bytes that match the WGSL layout exactly, because queue.writeBuffer() performs no translation. NumPy’s default contiguous layout does not insert the WGSL pad, so declare a structured dtype whose fields mirror the struct offset-for-offset and verify itemsize equals the WGSL SizeOf before upload.

python

import numpy as np

# Mirror of the WGSL SpatialTile: vec2f, vec2f, vec2f, u32, u32(pad) = 32 bytes.
tile_dtype = np.dtype([
    ("min_coord",       "<f4", (2,)),  # offset  0  (little-endian f32)
    ("max_coord",       "<f4", (2,)),  # offset  8
    ("elevation_range", "<f4", (2,)),  # offset 16
    ("tile_level",      "<u4"),        # offset 24
    ("_padding",        "<u4"),        # offset 28 -> stride 32
])
assert tile_dtype.itemsize == 32, "dtype stride must equal WGSL SizeOf"

tiles = np.zeros(1_000_000, dtype=tile_dtype)
# ...populate from GeoParquet / Arrow record batches...
gpu_bytes = tiles.tobytes()  # straight memcpy target for queue.writeBuffer()

The explicit <f4/<u4 little-endian codes are defensive: every current WebGPU target is little-endian, but pinning byte order removes one variable from cross-platform GIS pipelines that may pre-serialize on heterogeneous workers. When the source is GeoParquet or an Arrow RecordBatch, align the Arrow schema’s field order and types to this dtype so the conversion is a zero-copy view rather than a column-by-column repack — the same byte-parity discipline that lets the GPU read the upload without a host-side rebuild.

Continue in this section

Structuring uniform buffers for coordinate alignment — the per-frame transform block: mat4x4<f32> layout, dynamic-offset ring buffers, f32 precision and relative-to-center offsets, and the binding-overhead budget for interactive pan/zoom.

WebGPU compute vs render pipeline fundamentals — why compute and render stages must share a struct definition, and when to transform geometry in each.
Initializing WebGPU devices for GIS workloads — probing the buffer-size and binding limits that bound these layouts.
Browser support and fallback routing strategies — what to do where no conformant device exists to upload these buffers to.
Spatial compute shaders and geometry pipelines — the kernels that read and write the aligned buffers described here.
WebGPU framework integration and backend synchronization — streaming aligned GeoParquet/Arrow data from a Python backend into deck.gl and Cesium.

Up: WebGPU Architecture for Spatial Visualization