WebGL Performance Power-Up. Three.js, WASM, SIMD, and Lock-Free Concurrency
Building a 50K particle physics system that progressively layers WASM, SIMD, and Web Workers on top of Three.js. Real benchmarks, real code, real trade-offs.
ACTIVE_PHASE: PALLAV // 22 MIN READ
I built a 50,000-particle physics simulation that runs at 60fps in the browser. It started as plain JavaScript and Three.js. Then I hit a wall at 20K particles. This post walks through every optimization I applied — WASM, SIMD, Web Workers — showing the exact code, the benchmarks, and when each technique actually matters.
The live demo lets you switch between JS, WASM, WASM+SIMD, and Web Worker modes in real-time and see the FPS difference.
What We're Building
A particle system where 50K particles fall under gravity, bounce off walls, and render as points via Three.js. Same simulation, four different physics backends — each one faster than the last.
The Baseline: Three.js + JavaScript
Three.js handles the GPU side — scene graph, camera, and WebGL draw calls. The physics runs on the CPU: each frame we update 50K particles (position, velocity, gravity, boundary collisions) and push the results to a BufferGeometry.
Each particle is 6 floats: [x, y, z, vx, vy, vz]. That's 300K floats, about 1.2MB of data we're touching every frame.
const PARTICLE_COUNT = 50_000
const FLOATS_PER_PARTICLE = 6 // x, y, z, vx, vy, vz
const BOUNDS = 50.0
// Flat Float32Array — no objects, no GC pressure
const particleData = new Float32Array(PARTICLE_COUNT * FLOATS_PER_PARTICLE)
function initParticles(data) {
for (let i = 0; i < PARTICLE_COUNT; i++) {
const base = i * FLOATS_PER_PARTICLE
data[base + 0] = (Math.random() - 0.5) * BOUNDS * 1.5 // x
data[base + 1] = Math.random() * BOUNDS // y
data[base + 2] = (Math.random() - 0.5) * BOUNDS * 1.5 // z
data[base + 3] = (Math.random() - 0.5) * 20 // vx
data[base + 4] = (Math.random() - 0.5) * 10 // vy
data[base + 5] = (Math.random() - 0.5) * 20 // vz
}
}The physics loop is straightforward — gravity, Euler integration, boundary reflection:
function stepParticlesJS(data, dt, gravity, bounds, damping) {
for (let i = 0; i < PARTICLE_COUNT; i++) {
const base = i * FLOATS_PER_PARTICLE
// Gravity
data[base + 4] += gravity * dt
// Integrate velocity -> position
data[base + 0] += data[base + 3] * dt // x += vx * dt
data[base + 1] += data[base + 4] * dt // y += vy * dt
data[base + 2] += data[base + 5] * dt // z += vz * dt
// Boundary reflection
for (let axis = 0; axis < 3; axis++) {
const posIdx = base + axis
const velIdx = base + 3 + axis
if (data[posIdx] > bounds) {
data[posIdx] = bounds
data[velIdx] = -data[velIdx] * damping
} else if (data[posIdx] < -bounds) {
data[posIdx] = -bounds
data[velIdx] = -data[velIdx] * damping
}
}
}
}The Three.js rendering side copies positions from our flat array into a BufferGeometry each frame:
const geometry = new THREE.BufferGeometry()
const positionBuffer = new Float32Array(PARTICLE_COUNT * 3)
geometry.setAttribute('position', new THREE.BufferAttribute(positionBuffer, 3))
const material = new THREE.PointsMaterial({
size: 0.4,
color: 0x4a7fd4,
transparent: true,
opacity: 0.8,
sizeAttenuation: true,
})
const points = new THREE.Points(geometry, material)
scene.add(points)
// Each frame: extract x,y,z from interleaved data
function syncPositions(particleData, positionBuffer) {
for (let i = 0; i < PARTICLE_COUNT; i++) {
const src = i * FLOATS_PER_PARTICLE
const dst = i * 3
positionBuffer[dst] = particleData[src] // x
positionBuffer[dst + 1] = particleData[src + 1] // y
positionBuffer[dst + 2] = particleData[src + 2] // z
}
geometry.attributes.position.needsUpdate = true
}At 50K particles, the JS physics step takes about 0.23ms per frame on an M1 MacBook Pro (Chrome) — well within budget. But this scales linearly: at 500K particles it's 2.3ms, at 2M it's 9ms, and you start eating into your 16ms frame budget. The optimization techniques below become essential as scene complexity grows.
Level 1: Rust + WASM
The physics loop is pure arithmetic on a flat array — exactly the kind of work WASM was designed for. No DOM, no async, no string manipulation. Just math on contiguous memory.
The Rust Implementation
The WASM module takes a mutable slice of the same flat Float32Array and does the physics in Rust:
use wasm_bindgen::prelude::*;
/// Update particle positions and velocities in-place.
/// Layout per particle: [x, y, z, vx, vy, vz] (6 floats)
#[wasm_bindgen]
pub fn step_particles(
data: &mut [f32],
dt: f32,
gravity: f32,
bounds: f32,
damping: f32,
) {
let floats_per_particle = 6;
let count = data.len() / floats_per_particle;
for i in 0..count {
let base = i * floats_per_particle;
// Gravity
data[base + 4] += gravity * dt;
// Integrate velocity -> position
data[base + 0] += data[base + 3] * dt;
data[base + 1] += data[base + 4] * dt;
data[base + 2] += data[base + 5] * dt;
// Boundary reflection with damping
for axis in 0..3 {
let pos_idx = base + axis;
let vel_idx = base + 3 + axis;
if data[pos_idx] > bounds {
data[pos_idx] = bounds;
data[vel_idx] = -data[vel_idx] * damping;
} else if data[pos_idx] < -bounds {
data[pos_idx] = -bounds;
data[vel_idx] = -data[vel_idx] * damping;
}
}
}
}The Rust code is structurally identical to the JavaScript version. The performance difference comes from: (1) no JIT warmup — WASM executes at near-native speed immediately, (2) no type checks — f32 is always f32, and (3) the Rust compiler's optimizer (LLVM) is more aggressive than V8's TurboFan for tight numeric loops.
Build and Integration
[package]
name = "webgl-particle-demo"
version = "0.1.0"
edition = "2021"
[lib]
crate-type = ["cdylib"]
[dependencies]
wasm-bindgen = "0.2"
[profile.release]
opt-level = 3
lto = true# Build with wasm-pack
wasm-pack build --target web --releaseCalling it from JavaScript is seamless thanks to wasm-bindgen:
import init, { step_particles } from '../pkg/webgl_particle_demo.js'
await init()
// Same data, same parameters — just call the WASM function
function animate() {
const start = performance.now()
step_particles(particleData, 1/60, -9.81, BOUNDS, 0.7)
const physicsTime = performance.now() - start
syncPositions(particleData, positionBuffer)
renderer.render(scene, camera)
requestAnimationFrame(animate)
}Here's the honest result: for this particular loop (AoS layout, 50K particles), scalar WASM was not faster than JavaScript on Chrome/M1. V8's TurboFan JIT is extremely good at optimizing tight numeric loops on typed arrays. The real payoff comes later — when we combine WASM with SIMD and a data layout designed for vectorization.
When NOT to Use WASM
- Your code is I/O bound (waiting on network, disk) rather than CPU bound
- Operations already run within budget in JavaScript (< 16ms per frame)
- Code runs rarely (one-time initialization, event handlers)
- The overhead of copying data between JS and WASM exceeds the gains — for small arrays (< 1000 elements), JS is often faster due to marshalling cost
Level 2: SIMD — Processing 4 Floats Per Instruction
SIMD (Single Instruction, Multiple Data) lets the CPU operate on 4 floats simultaneously via 128-bit registers. Instead of adding positions one at a time, we load 4 values, add 4 values, store 4 values — all in a single instruction.
WASM SIMD maps directly to hardware SIMD on both x86 (SSE4.1) and ARM (NEON). The browser handles the translation.
SIMD Particle Physics
The velocity integration (position += velocity * dt) vectorizes well. Boundary checks don't — they involve branching per-particle. So we split the work: SIMD for integration, scalar for boundaries.
#[cfg(target_arch = "wasm32")]
use std::arch::wasm32::*;
#[wasm_bindgen]
pub fn step_particles_simd(
data: &mut [f32],
dt: f32,
gravity: f32,
bounds: f32,
damping: f32,
) {
let floats_per_particle = 6;
let count = data.len() / floats_per_particle;
// SIMD pass: velocity integration
// Processes particles in pairs (2 particles = 12 floats = 3 x v128)
let vdt = f32x4_splat(dt);
let pairs = count / 2;
for pair in 0..pairs {
let base = pair * 12;
// Apply gravity to both particles' vy
data[base + 4] += gravity * dt;
data[base + 10] += gravity * dt;
// SIMD: load 4 consecutive floats, multiply by dt, add to position
unsafe {
let pos = v128_load(data.as_ptr().add(base) as *const v128);
let vel_offset = v128_load(data.as_ptr().add(base + 3) as *const v128);
let updated = f32x4_add(pos, f32x4_mul(vel_offset, vdt));
v128_store(data.as_mut_ptr().add(base) as *mut v128, updated);
}
// Scalar: remaining components
data[base + 1] += data[base + 4] * dt;
data[base + 2] += data[base + 5] * dt;
data[base + 7] += data[base + 10] * dt;
data[base + 8] += data[base + 11] * dt;
}
// Scalar pass: boundary reflection (branching doesn't vectorize)
for i in 0..count {
let base = i * floats_per_particle;
for axis in 0..3 {
let pos_idx = base + axis;
let vel_idx = base + 3 + axis;
if data[pos_idx] > bounds {
data[pos_idx] = bounds;
data[vel_idx] = -data[vel_idx] * damping;
} else if data[pos_idx] < -bounds {
data[pos_idx] = -bounds;
data[vel_idx] = -data[vel_idx] * damping;
}
}
}
}SIMD Matrix Multiply
Where SIMD really shines is linear algebra. A 4x4 matrix multiply in scalar code is 64 multiplications and 48 additions. With SIMD, each output column is computed with 4 multiply+add operations on full 128-bit vectors:
/// SIMD 4x4 matrix multiply (column-major, matches WebGL/Three.js convention).
#[wasm_bindgen]
pub fn simd_mat4_multiply(a: &[f32], b: &[f32]) -> Vec<f32> {
assert!(a.len() >= 16 && b.len() >= 16);
let mut out = vec![0.0f32; 16];
unsafe {
let a_col0 = v128_load(a.as_ptr().add(0) as *const v128);
let a_col1 = v128_load(a.as_ptr().add(4) as *const v128);
let a_col2 = v128_load(a.as_ptr().add(8) as *const v128);
let a_col3 = v128_load(a.as_ptr().add(12) as *const v128);
for col in 0..4 {
let b_base = col * 4;
let b0 = f32x4_splat(b[b_base]);
let b1 = f32x4_splat(b[b_base + 1]);
let b2 = f32x4_splat(b[b_base + 2]);
let b3 = f32x4_splat(b[b_base + 3]);
// out_col = a_col0*b0 + a_col1*b1 + a_col2*b2 + a_col3*b3
let r = f32x4_add(
f32x4_add(f32x4_mul(a_col0, b0), f32x4_mul(a_col1, b1)),
f32x4_add(f32x4_mul(a_col2, b2), f32x4_mul(a_col3, b3)),
);
let tmp: [f32; 4] = std::mem::transmute(r);
out[col * 4..col * 4 + 4].copy_from_slice(&tmp);
}
}
out
}This matters when you're transforming thousands of vertices per frame. Batch-multiplying 10K matrices with SIMD runs 3-5x faster than scalar WASM.
Building with SIMD
# Enable SIMD in the compiled WASM output
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web --releaseMemory Alignment Matters
SIMD loads (v128_load) work best on 16-byte-aligned memory. Misaligned loads still work on modern hardware but may be slower on some platforms. For hot data structures, enforce alignment:
#[repr(align(16))]
struct AlignedParticles {
data: Vec<f32>,
}
// For the particle system, Float32Array is already 4-byte aligned,
// and WASM v128_load handles unaligned access correctly.
// Explicit alignment matters more for custom SIMD data structures.When SIMD Doesn't Help
SIMD is a poor fit when: data sets are small (< 1000 elements, overhead dominates), operations involve heavy branching (if/else per element), or memory access is unpredictable (random indices, pointer chasing). Our boundary checks are a good example -- they branch per particle, so they stay scalar.
Browser Support
| Browser | WASM SIMD Support | Minimum Version |
|---|---|---|
| Chrome | Yes | 91+ (May 2021) |
| Firefox | Yes | 89+ (June 2021) |
| Safari | Yes | 16.4+ (March 2023) |
| Edge | Yes | 91+ (May 2021) |
Feature Detection
import { simd } from 'wasm-feature-detect'
const simdSupported = await simd()
if (simdSupported) {
// Load SIMD-enabled WASM build
const { step_particles_simd } = await import('./pkg/simd/webgl_particle_demo.js')
} else {
// Fallback to scalar WASM
const { step_particles } = await import('./pkg/scalar/webgl_particle_demo.js')
}Level 3: Web Workers + SharedArrayBuffer
WASM+SIMD makes the physics faster per-core, but we're still single-threaded. Modern devices have 4-8+ cores sitting idle. Web Workers let us split the particle array across cores — each worker updates its range, and the main thread just reads the shared memory for rendering.
The Architecture
The key insight: each particle's physics is independent. Worker 1 updates particles 0-12,499, Worker 2 updates 12,500-24,999, and so on. No locks needed because the memory ranges don't overlap.
// Allocate shared memory (accessible by all workers simultaneously)
const sharedBuffer = new SharedArrayBuffer(
PARTICLE_COUNT * FLOATS_PER_PARTICLE * Float32Array.BYTES_PER_ELEMENT
)
const particleData = new Float32Array(sharedBuffer)
// Spawn workers, each owns a non-overlapping range
const WORKER_COUNT = Math.min(navigator.hardwareConcurrency || 2, 4)
const particlesPerWorker = Math.floor(PARTICLE_COUNT / WORKER_COUNT)
for (let i = 0; i < WORKER_COUNT; i++) {
const worker = new Worker(new URL('./particle-worker.js', import.meta.url))
const start = i * particlesPerWorker
const end = i === WORKER_COUNT - 1 ? PARTICLE_COUNT : (i + 1) * particlesPerWorker
// Pass the shared buffer — no copying, workers see the same memory
worker.postMessage({ type: 'init', buffer: sharedBuffer, start, end })
}const FLOATS_PER_PARTICLE = 6
const GRAVITY = -9.81
const BOUNDS = 50.0
const DAMPING = 0.7
const DT = 1 / 60
let particles = null
let startIdx = 0
let endIdx = 0
self.onmessage = (e) => {
if (e.data.type === 'init') {
// Wrap the shared buffer — same memory, no copy
particles = new Float32Array(e.data.buffer)
startIdx = e.data.start
endIdx = e.data.end
tick()
}
}
function tick() {
for (let i = startIdx; i < endIdx; i++) {
const base = i * FLOATS_PER_PARTICLE
particles[base + 4] += GRAVITY * DT
particles[base + 0] += particles[base + 3] * DT
particles[base + 1] += particles[base + 4] * DT
particles[base + 2] += particles[base + 5] * DT
for (let axis = 0; axis < 3; axis++) {
const posIdx = base + axis
const velIdx = base + 3 + axis
if (particles[posIdx] > BOUNDS) {
particles[posIdx] = BOUNDS
particles[velIdx] = -particles[velIdx] * DAMPING
} else if (particles[posIdx] < -BOUNDS) {
particles[posIdx] = -BOUNDS
particles[velIdx] = -particles[velIdx] * DAMPING
}
}
}
setTimeout(tick, 16) // ~60fps
}The render loop on the main thread just reads from the shared buffer — the workers are updating it continuously in the background:
function animate() {
// Workers are writing positions in the background.
// We just read from the same SharedArrayBuffer — zero copy.
syncPositions(particleData, positionBuffer)
geometry.attributes.position.needsUpdate = true
renderer.render(scene, camera)
requestAnimationFrame(animate)
}SharedArrayBuffer Requires Cross-Origin Isolation
Your server must set these headers, otherwise SharedArrayBuffer will be undefined:
`
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
`
This can break third-party iframes and scripts that don't support CORP. Test thoroughly.
When Workers Need the Same Data: Atomics
Our particle system doesn't need synchronization because workers own non-overlapping ranges. But what about shared state — a collision counter, a global force accumulator, a spatial hash grid? That's where Atomics comes in.
// Shared state array (separate from particle data)
const stateBuffer = new SharedArrayBuffer(16 * Int32Array.BYTES_PER_ELEMENT)
const state = new Int32Array(stateBuffer)
// Worker: atomically increment collision counter
Atomics.add(state, 0, 1) // index 0 = collision count
// Worker: CAS (Compare-And-Swap) for more complex updates
let old, updated
do {
old = Atomics.load(state, 1)
updated = old + computeForce()
} while (Atomics.compareExchange(state, 1, old, updated) !== old)
// Main thread: wait for workers to signal completion
Atomics.wait(state, 2, 0) // block until index 2 != 0
// Worker: signal main thread
Atomics.store(state, 2, 1)
Atomics.notify(state, 2) // wake up waiting threadStart Here
JavaScript's built-in Atomics API covers 90% of multithreading use cases. It's simple, safe, and doesn't require WASM. Only reach for advanced lock-free techniques when profiling shows contention is the bottleneck.
Browser Support for SharedArrayBuffer
| Browser | Support | Requirements |
|---|---|---|
| Chrome | 68+ | Cross-Origin Isolation headers |
| Firefox | 79+ | Cross-Origin Isolation headers |
| Safari | 15.2+ | Cross-Origin Isolation headers |
| Edge | 79+ | Cross-Origin Isolation headers |
import { simd, threads, bulkMemory } from 'wasm-feature-detect'
async function detectCapabilities() {
const features = {
simd: await simd(),
threads: await threads(),
bulkMemory: await bulkMemory(),
sharedArrayBuffer: typeof SharedArrayBuffer !== 'undefined',
crossOriginIsolated: crossOriginIsolated ?? false,
}
console.table(features)
if (!features.sharedArrayBuffer || !features.crossOriginIsolated) {
console.warn('Multi-threading unavailable. Falling back to single-thread WASM.')
}
return features
}Level 4: Lock-Free Concurrency (When You Actually Need It)
You Probably Don't Need This
Lock Striping is an advanced technique for high-contention scenarios -- many workers frequently competing for the same resources. For particle systems, non-overlapping ranges (Level 3) are almost always sufficient. Only proceed if profiling shows lock contention is your bottleneck.
But sometimes workers do need shared mutable state — a spatial hash grid for collision detection, a shared resource pool, a concurrent command buffer. When Atomics.add isn't enough and a single global lock creates a bottleneck, Lock Striping divides the lock into N independent stripes.
How Lock Striping Works
Instead of one lock protecting all shared resources, you create a table of locks. Each resource maps to a stripe via resource_index % stripe_count. Workers only block each other when they happen to access the same stripe — which, with enough stripes, is rare.
use std::sync::atomic::{AtomicI32, Ordering};
#[wasm_bindgen]
pub struct StripedLockTable {
locks: Vec<AtomicI32>,
stripe_count: usize,
}
#[wasm_bindgen]
impl StripedLockTable {
#[wasm_bindgen(constructor)]
pub fn new(stripe_count: usize) -> Self {
let mut locks = Vec::with_capacity(stripe_count);
for _ in 0..stripe_count {
locks.push(AtomicI32::new(0)); // 0 = unlocked
}
Self { locks, stripe_count }
}
/// Map a resource index to a stripe.
#[wasm_bindgen(js_name = stripeFor)]
pub fn stripe_for(&self, resource_index: usize) -> usize {
resource_index % self.stripe_count
}
/// Try to acquire using CAS. Returns true if lock acquired.
#[wasm_bindgen(js_name = tryAcquire)]
pub fn try_acquire(&self, stripe: usize) -> bool {
if stripe >= self.stripe_count { return false }
// Atomic CAS: swap 0 (unlocked) -> 1 (locked)
self.locks[stripe]
.compare_exchange(0, 1, Ordering::Acquire, Ordering::Relaxed)
.is_ok()
}
/// Release the stripe lock.
#[wasm_bindgen]
pub fn release(&self, stripe: usize) {
if stripe < self.stripe_count {
self.locks[stripe].store(0, Ordering::Release);
}
}
}import init, { StripedLockTable } from './pkg/webgl_particle_demo.js'
await init()
const locks = new StripedLockTable(16) // 16 stripes
async function accessSharedResource(resourceIndex) {
const stripe = locks.stripeFor(resourceIndex)
const MAX_RETRIES = 100
for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
if (locks.tryAcquire(stripe)) {
try {
// Safe to read/write the shared resource
updateSpatialHashCell(resourceIndex)
} finally {
locks.release(stripe)
}
return
}
// Exponential backoff: 1ms, 2ms, 4ms... max 50ms
await new Promise(r => setTimeout(r, Math.min(2 ** attempt, 50)))
}
console.error(`Failed to acquire stripe ${stripe} after ${MAX_RETRIES} retries`)
}Benchmarks
Benchmark Methodology
All benchmarks run in-browser using performance.now(). Each test does 100 warmup iterations (to stabilize JIT), then 1000 timed iterations (500 for particle step), repeated 5 times. We report the median of those 5 runs. Hardware: Apple M1 MacBook Pro, 16GB RAM, Chrome. Benchmarks use in-place operations (pre-allocated output buffers) to isolate computation from allocation overhead. The particle step uses Structure-of-Arrays (SoA) layout for full SIMD lane utilization. Your numbers will differ -- the ratios are what matter, not the absolute times. Run window.runBenchmarks() in the demo to see your own results.
| Test | JavaScript | WASM Scalar | WASM+SIMD | WASM vs JS | SIMD vs JS |
|---|---|---|---|---|---|
| Vector Add (100K floats, 1000 ops) | 413ms | 75ms | 44ms | 5.5x | 9.5x |
| 4x4 Matrix Multiply (10K ops) | 0.66ms | 1.88ms | 1.89ms | 0.35x | 0.35x |
| Particle Step (50K particles, 500 frames) | 116ms | 161ms | 32ms | 0.72x | 3.6x |
The results reveal something important: WASM isn't universally faster than JavaScript. For the 4x4 matrix multiply (only 16 floats), V8's TurboFan JIT optimizes the tight loop so aggressively that the WASM function-call overhead actually makes it slower. The scalar particle step shows a similar pattern -- V8 is competitive on the AoS loop. But SIMD changes the equation dramatically: the SoA particle step with full SIMD is 3.6x faster than JS, and vector addition hits 9.5x. The takeaway: WASM's advantage scales with data size and SIMD utilization, not with code complexity.
Web Workers add another dimension. With 4 workers on an M1 (4 performance cores), physics runs concurrently on all cores while the main thread focuses on rendering. For embarrassingly parallel workloads like our particle system (non-overlapping ranges), you get near-linear scaling up to the core count.
Typical Gains by Technique
| Technique | Measured Speedup | Best For | Complexity |
|---|---|---|---|
| WASM (scalar) | 0.3-5.5x vs JS (depends on data size) | Large-array numeric compute (> 10K elements) | Medium -- Rust toolchain, wasm-pack |
| WASM + SIMD | 3.6-9.5x vs JS | Vectorizable math on large datasets (SoA layout) | High -- SIMD intrinsics, data layout matters |
| Web Workers (4 cores) | ~3x vs single thread | Embarrassingly parallel workloads | Low -- SharedArrayBuffer, message passing |
| Lock Striping | 20-40% vs global lock | High-contention shared state | Very high -- CAS, backoff, stripe tuning |
Key insight from benchmarking: data layout matters more than the language. Scalar WASM with AoS (Array of Structs) layout was actually slower than JS for our particle step. But switching to SoA (Structure of Arrays) -- where all x values are contiguous, all y values are contiguous, etc. -- let SIMD process 4 particles per instruction with 100% lane utilization. That single change turned a 0.72x regression into a 3.6x speedup.
When NOT to Use These Optimizations
Every technique adds complexity. Here's an honest assessment of when each one isn't worth it.
Skip WASM if
- Your bottleneck is I/O (network, disk), not CPU
- Operations fit within budget in JS (< 8ms per frame leaves room)
- Code runs rarely (init, event handlers, not per-frame)
- Data is small — marshalling overhead between JS and WASM can negate gains for arrays under ~1000 elements
Skip SIMD if
- Data sets are small (< 1000 elements) — setup overhead dominates
- Operations involve heavy branching (if/else per element doesn't vectorize)
- Memory access is random (pointer chasing, hash lookups) — SIMD needs contiguous data
- Safari support matters and you can't ship two WASM builds (SIMD + fallback)
Skip Workers if
- You're already at 60fps without them
- The workload isn't parallelizable (sequential dependencies between steps)
- Communication overhead exceeds gains — if workers need to sync every frame, message passing can cost more than the parallelism saves
- Cross-Origin Isolation headers break your third-party integrations
Skip Lock Striping if
- Your workers can operate on non-overlapping data (like our particle ranges)
- You have fewer than 4 workers contending on shared state
- Simple
Atomics.add/Atomics.compareExchangehandles your coordination needs - The debugging complexity isn't justified — lock-free bugs are notoriously hard to reproduce
The Golden Rule
Profile first, optimize later. Don't add complexity unless measurements prove it's necessary. Use Chrome DevTools Performance tab to identify where your frame time actually goes.
Decision Framework
Implementation Path
- Start with JavaScript. Use flat TypedArrays, avoid object allocation in hot loops. Often this is enough.
- Profile. If physics/compute takes > 8ms per frame, continue.
- Add Web Workers if the work is parallelizable. SharedArrayBuffer + non-overlapping ranges is simple and effective.
- Move hot paths to WASM if single-thread performance is still the bottleneck. Rust + wasm-pack makes this straightforward.
- Add SIMD to WASM code for vectorizable operations (matrix math, batch transforms). Ship with a scalar fallback.
- Lock Striping only if profiling shows workers are contending on shared mutable state.
Technology Decision Matrix
| Technique | Use When | Complexity | Typical Gain |
|---|---|---|---|
| Web Workers + SharedArrayBuffer | Parallelizable compute, > 2 cores available | Low | ~3x (4 cores) |
| Rust + WASM (scalar) | Large arrays (100K+ elements), CPU-bound | Medium | 0.3-5.5x (depends on data size) |
| WASM SIMD + SoA layout | Vectorizable math, contiguous same-type data | High | 3.6-9.5x (vs JS) |
| Lock Striping | 4+ workers, high contention on shared state | Very High | 20-40% (vs global lock) |
Who Actually Needs Each Level
Workers + Atomics (90% of cases): Particle systems, async physics, procedural generation, image processing pipelines.
Rust + WASM (10% of cases): Complex algorithms (pathfinding, fluid simulation, spatial indexing), large dataset processing, cryptography.
SIMD (5% of cases): Batch matrix/vector operations at scale, real-time audio DSP, image convolution filters.
Lock Striping (< 1% of cases): 8+ workers sharing a concurrent spatial hash, real-time multiplayer game engines, high-frequency data visualization.
Key Takeaways
- Data layout matters more than language. Switching from AoS to SoA turned a 0.72x WASM regression into a 3.6x SIMD speedup. Think about your memory layout before reaching for WASM.
- WASM isn't automatically faster. V8's TurboFan is extremely competitive on small, tight loops. WASM's advantage shows at scale (100K+ elements) and with SIMD.
- SIMD is the real multiplier. Vector addition: 9.5x faster than JS. Particle physics (SoA): 3.6x. But it requires contiguous same-type data — SoA, not AoS.
- Web Workers scale with cores for embarrassingly parallel work. Non-overlapping data ranges mean zero synchronization overhead.
- Lock Striping is a last resort.
Atomics.addandAtomics.compareExchangehandle most concurrency needs without custom lock tables. - Profile before optimizing. Our benchmarks proved that scalar WASM can be slower than JS. Measure, don't assume.
The full source code for the particle demo — including the Rust WASM module, Three.js renderer, Web Workers, and benchmark harness — is available in the live demo.