Writing a WGSL Kernel for Point-in-Polygon Clustering

When the bins you cluster into are irregular administrative boundaries — census tracts, watershed polygons, delivery zones — a regular grid is no longer enough: each point must be tested for containment against a polygon set, and at viewport scale that test runs millions of times per frame. This page takes one stage of the asynchronous clustering pipeline to copy-pasteable depth: a compute pipeline kernel written in WGSL that runs a parallel crossing-number (ray-casting) test, assigns each point to its containing polygon, and accumulates per-polygon counts with atomics. The hard parts are not the algorithm — it is making the algorithm fast and correct on SIMD hardware while respecting the buffer memory alignment rules WebGPU enforces.

The Kernel: Parallel Ray-Casting in WGSL

Each invocation evaluates a single point against the polygon set, walks the polygons in order, and stops at the first containing polygon. Containment uses the standard crossing-number test: cast a ray from the point along +x and count edge crossings; an odd count means inside.

wgsl

struct Point {
    x: f32,
    y: f32,
};

struct PolygonVertex {
    x: f32,
    y: f32,
};

@group(0) @binding(0) var<storage, read> points: array<Point>;
@group(0) @binding(1) var<storage, read> polygon_vertices: array<PolygonVertex>;
@group(0) @binding(2) var<storage, read> polygon_offsets: array<u32>;
@group(0) @binding(3) var<storage, read_write> cluster_assignments: array<u32>;
@group(0) @binding(4) var<storage, read_write> cluster_counts: array<atomic<u32>>;

fn point_in_polygon(px: f32, py: f32, poly_start: u32, poly_len: u32) -> bool {
    var inside = false;
    var j = poly_start + poly_len - 1u;            // previous vertex, wraps to last
    for (var i = poly_start; i < poly_start + poly_len; i = i + 1u) {
        let xi = polygon_vertices[i].x;
        let yi = polygon_vertices[i].y;
        let xj = polygon_vertices[j].x;
        let yj = polygon_vertices[j].y;

        // Ray-casting: does the horizontal ray at py cross edge (j -> i)?
        let intersect = ((yi > py) != (yj > py)) &&
                        (px < (xj - xi) * (py - yi) / (yj - yi) + xi);
        if (intersect) {
            inside = !inside;
        }
        j = i;
    }
    return inside;
}

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
    let idx = gid.x;
    if (idx >= arrayLength(&points)) { return; }     // guard the tail workgroup

    let p = points[idx];
    var assigned_cluster: u32 = 0xFFFFFFFFu;          // sentinel: unassigned

    // polygon_offsets stores [start, len] pairs; polygon count is half the length.
    let num_polys = arrayLength(&polygon_offsets) / 2u;
    for (var poly_idx: u32 = 0u; poly_idx < num_polys; poly_idx = poly_idx + 1u) {
        let start = polygon_offsets[poly_idx * 2u];
        let len = polygon_offsets[poly_idx * 2u + 1u];
        if (point_in_polygon(p.x, p.y, start, len)) {
            assigned_cluster = poly_idx;
            break;                                     // first-match priority
        }
    }

    if (assigned_cluster != 0xFFFFFFFFu) {
        cluster_assignments[idx] = assigned_cluster;
        atomicAdd(&cluster_counts[assigned_cluster], 1u);
    }
}

A few choices are load-bearing for spatial data specifically:

Thread mapping. global_invocation_id.x maps one invocation per input point, so there is no shared-memory synchronization during the evaluation pass. The tail guard (idx >= arrayLength(&points)) is mandatory because the dispatch rounds point count up to a whole number of 256-wide workgroups.
Offset indirection. Polygons vary in vertex count, so polygon_offsets carries [start, len] pairs that resolve a polygon’s vertex span in O(1) without a separate index buffer.
First-match break. The break assumes mutually exclusive bins. For overlapping polygons, drop the break and accumulate into a bitmask or run a second pass — see the failure modes below.

Reference Implementation: Dispatch and Readback

The kernel is only useful once it is wired into an asynchronous submission so the main thread never blocks on the GPU. Record the compute pass, copy the result into a mappable buffer, submit, and resolve completion through onSubmittedWorkDone(). (Device acquisition itself — adapter request, polling, and limit negotiation for large feature sets — is covered in setting up WebGPU device polling for GIS apps.)

javascript

const computePipeline = device.createComputePipeline({
  layout: 'auto',
  compute: { module, entryPoint: 'main' }
});

const bindGroup = device.createBindGroup({
  layout: computePipeline.getBindGroupLayout(0),
  entries: [
    { binding: 0, resource: { buffer: pointsBuffer } },
    { binding: 1, resource: { buffer: verticesBuffer } },
    { binding: 2, resource: { buffer: offsetsBuffer } },
    { binding: 3, resource: { buffer: assignmentsBuffer } },
    { binding: 4, resource: { buffer: countsBuffer } }
  ]
});

const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(computePipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(pointCount / 256));   // one invocation per point
pass.end();

// Copy results into a MAP_READ buffer in the same submission.
const readBuffer = device.createBuffer({
  size: assignmentsBuffer.size,
  usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
});
encoder.copyBufferToBuffer(assignmentsBuffer, 0, readBuffer, 0, assignmentsBuffer.size);
device.queue.submit([encoder.finish()]);

await device.queue.onSubmittedWorkDone();               // do not block the main thread
await readBuffer.mapAsync(GPUMapMode.READ);
const results = new Uint32Array(readBuffer.getMappedRange());
readBuffer.unmap();

Note that assignmentsBuffer must be created with GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC, and countsBuffer likewise, or the copyBufferToBuffer() call raises a validation error at submit time. If a viewport’s feature or vertex count pushes past the default storage-binding ceiling, raise it during device init as described in configuring WebGPU adapter limits for large GeoJSON.

Parameter and Configuration Reference

Every tunable in the kernel and dispatch, with spatial-workload guidance:

Parameter	Value here	Guidance
`@workgroup_size`	`256`	Multiple of the wavefront (32 NVIDIA / 64 AMD); 256 balances occupancy against register pressure for the scalar predicate work. Profile 64/128/256 per device.
Dispatch count	`ceil(pointCount / 256)`	One invocation per point; always pair with the `idx >= arrayLength` tail guard.
Unassigned sentinel	`0xFFFFFFFFu`	Reserve the max `u32` so a real polygon index can never collide with “not contained”.
`polygon_offsets` stride	2 × `u32` per polygon	`[start, len]` pairs; total polygon count is `arrayLength / 2`.
Vertex struct stride	8 bytes (`2 × f32`)	Meets WebGPU’s 4-byte storage-scalar alignment; pad to 16 bytes if you also store per-vertex extents as `vec4<f32>`.
Boundary epsilon	`1e-6` (optional)	Nudge points off exact edges to make ray-casting deterministic; size relative to coordinate units (degrees vs. meters).
AABB pre-filter	off by default	Append `min_x,min_y,max_x,max_y` per polygon and test before the edge loop — eliminates 60-80% of edge tests on sparse data.

Optimization: Branch Divergence and Bounding-Box Pre-Filtering

The crossing-number test is branch-heavy, and divergent branches stall SIMD lanes: within a wavefront, lanes that take the intersect branch wait for lanes that do not. Three mitigations matter most for polygon clustering:

Bounding-box pre-filter. Store each polygon’s axis-aligned bounding box alongside its offset, and reject points outside the box before entering the edge loop. For sparse point distributions this removes the large majority of expensive per-edge work and is the single highest-leverage optimization. Tuning the workgroup shape around this kind of early-out predicate is the same problem covered in optimizing workgroup sizes for vector geometry filtering.
Shared-memory vertex caching. For large polygon sets (>10k vertices) repeatedly read by a workgroup, stage hot vertices into var<workgroup> shared memory and synchronize with workgroupBarrier(). This cuts global-memory bandwidth at the cost of occupancy, so measure before committing.
Tile-parallel evaluation. When polygons partition cleanly by tile, assign one workgroup per tile via workgroup_id so each group only tests the polygons that overlap its tile — see using workgroup_id for parallel tile processing.

Failure Modes Specific to PIP Clustering

Boundary precision — points exactly on an edge. A point whose py equals a vertex y makes the (yi > py) != (yj > py) predicate ambiguous, so containment becomes non-deterministic and a point on a shared edge may land in zero or two bins. Detect it by diffing GPU assignments against a CPU reference and looking for points whose coordinates coincide with polygon vertices. Fix by nudging with a small epsilon (px += 1e-6) or, for production GIS accuracy, switching to a winding-number test that handles boundaries consistently.

Atomic counter overflow / contention. atomicAdd serializes when many invocations target the same dense polygon, throttling throughput; and if a count exceeds the counter’s capacity it wraps silently. Detect contention with timestampWrites showing the reduce stage dominating. Mitigate by accumulating into per-workgroup partial counts in shared memory and adding once per workgroup, or by deferring counts to a prefix-sum pass over cluster_assignments.

Overlapping polygons mis-binned. The first-match break silently assigns a point to whichever polygon appears first in the buffer when bins overlap (e.g. nested zones, or topology errors in the source data). Detect by checking for known-overlapping features in the dataset. Fix by removing the break and writing a per-point bitmask of all containing polygons, or by ordering polygons so the intended priority wins.

Self-intersecting or unclosed rings. GeoJSON rings that are not explicitly closed, or that self-intersect, break the even-odd assumption and produce holes or inverted containment. Detect during packing by verifying the first and last vertex coincide. Fix on the backend by closing rings and running a validity repair before upload.

Backend / Python Interop

GPU buffers must be contiguous and aligned, so flatten hierarchical GeoJSON or Shapefile geometry on the Python side before upload. Interleave x, y into one f32 array and emit matching [start, len] offset pairs; the same little-endian, tightly packed layout that satisfies the kernel is the layout described in memory alignment for spatial data buffers.

python

# Flatten GeoJSON polygon rings for WebGPU upload.
import numpy as np

def pack_polygons(geojson_features):
    vertices = []
    offsets = []
    current_idx = 0
    for feat in geojson_features:
        ring = feat["geometry"]["coordinates"][0]   # exterior ring
        if ring[0] != ring[-1]:                      # close the ring if open
            ring = ring + [ring[0]]
        for x, y in ring:
            vertices.extend([x, y])
        offsets.extend([current_idx, len(ring)])     # [start, vertex_count]
        current_idx += len(ring)
    return (
        np.array(vertices, dtype=np.float32),        # -> verticesBuffer
        np.array(offsets, dtype=np.uint32),          # -> offsetsBuffer
    )

When the data originates from GeoPandas, Dask, or a GeoParquet store, reproject all features to a single planar CRS before packing — mixing geographic (degrees) and projected (meters) coordinates corrupts both the ray-casting math and the epsilon you chose for boundary handling. The per-polygon counts this kernel produces typically feed a downstream reduction; partitioning and fragmentation concerns for that stage are covered in reducing GPU memory fragmentation during spatial aggregation.

Async Dispatch Patterns for Spatial Clustering — the submission graph this kernel slots into
Geometry Filtering with WGSL Compute Shaders — the upstream stage that compacts features before containment testing
Memory Alignment for Spatial Data Buffers — the packing and stride rules the vertex buffer must follow
Configuring WebGPU Adapter Limits for Large GeoJSON — raising storage-binding ceilings for dense polygon sets
Browser Support & Fallback Routing Strategies — CPU/WebGL paths when WebGPU is unavailable or the device is lost

Up: Async Dispatch Patterns for Spatial Clustering