Skip to main content
WASM2025_10_27

WebGL Performance Power-Up. Three.js, WASM, SIMD, and Lock-Free Concurrency

Building a 50K particle physics system that progressively layers WASM, SIMD, and Web Workers on top of Three.js. Real benchmarks, real code, real trade-offs.

ACTIVE_PHASE: PALLAV // 22 MIN READ

I built a 50,000-particle physics simulation that runs at 60fps in the browser. It started as plain JavaScript and Three.js. Then I hit a wall at 20K particles. This post walks through every optimization I applied — WASM, SIMD, Web Workers — showing the exact code, the benchmarks, and when each technique actually matters.

The live demo lets you switch between JS, WASM, WASM+SIMD, and Web Worker modes in real-time and see the FPS difference.

What We're Building

A particle system where 50K particles fall under gravity, bounce off walls, and render as points via Three.js. Same simulation, four different physics backends — each one faster than the last.


The Baseline: Three.js + JavaScript

Three.js handles the GPU side — scene graph, camera, and WebGL draw calls. The physics runs on the CPU: each frame we update 50K particles (position, velocity, gravity, boundary collisions) and push the results to a BufferGeometry.

Each particle is 6 floats: [x, y, z, vx, vy, vz]. That's 300K floats, about 1.2MB of data we're touching every frame.

CODE_MANIFESTjs/main.js
const PARTICLE_COUNT = 50_000
const FLOATS_PER_PARTICLE = 6 // x, y, z, vx, vy, vz
const BOUNDS = 50.0

// Flat Float32Array — no objects, no GC pressure
const particleData = new Float32Array(PARTICLE_COUNT * FLOATS_PER_PARTICLE)

function initParticles(data) {
  for (let i = 0; i < PARTICLE_COUNT; i++) {
    const base = i * FLOATS_PER_PARTICLE
    data[base + 0] = (Math.random() - 0.5) * BOUNDS * 1.5 // x
    data[base + 1] = Math.random() * BOUNDS               // y
    data[base + 2] = (Math.random() - 0.5) * BOUNDS * 1.5 // z
    data[base + 3] = (Math.random() - 0.5) * 20           // vx
    data[base + 4] = (Math.random() - 0.5) * 10           // vy
    data[base + 5] = (Math.random() - 0.5) * 20           // vz
  }
}

The physics loop is straightforward — gravity, Euler integration, boundary reflection:

CODE_MANIFESTjs/main.js
function stepParticlesJS(data, dt, gravity, bounds, damping) {
  for (let i = 0; i < PARTICLE_COUNT; i++) {
    const base = i * FLOATS_PER_PARTICLE

    // Gravity
    data[base + 4] += gravity * dt

    // Integrate velocity -> position
    data[base + 0] += data[base + 3] * dt // x += vx * dt
    data[base + 1] += data[base + 4] * dt // y += vy * dt
    data[base + 2] += data[base + 5] * dt // z += vz * dt

    // Boundary reflection
    for (let axis = 0; axis < 3; axis++) {
      const posIdx = base + axis
      const velIdx = base + 3 + axis
      if (data[posIdx] > bounds) {
        data[posIdx] = bounds
        data[velIdx] = -data[velIdx] * damping
      } else if (data[posIdx] < -bounds) {
        data[posIdx] = -bounds
        data[velIdx] = -data[velIdx] * damping
      }
    }
  }
}

The Three.js rendering side copies positions from our flat array into a BufferGeometry each frame:

CODE_MANIFESTjs/main.js
const geometry = new THREE.BufferGeometry()
const positionBuffer = new Float32Array(PARTICLE_COUNT * 3)
geometry.setAttribute('position', new THREE.BufferAttribute(positionBuffer, 3))

const material = new THREE.PointsMaterial({
  size: 0.4,
  color: 0x4a7fd4,
  transparent: true,
  opacity: 0.8,
  sizeAttenuation: true,
})

const points = new THREE.Points(geometry, material)
scene.add(points)

// Each frame: extract x,y,z from interleaved data
function syncPositions(particleData, positionBuffer) {
  for (let i = 0; i < PARTICLE_COUNT; i++) {
    const src = i * FLOATS_PER_PARTICLE
    const dst = i * 3
    positionBuffer[dst] = particleData[src]         // x
    positionBuffer[dst + 1] = particleData[src + 1] // y
    positionBuffer[dst + 2] = particleData[src + 2] // z
  }
  geometry.attributes.position.needsUpdate = true
}

At 50K particles, the JS physics step takes about 0.23ms per frame on an M1 MacBook Pro (Chrome) — well within budget. But this scales linearly: at 500K particles it's 2.3ms, at 2M it's 9ms, and you start eating into your 16ms frame budget. The optimization techniques below become essential as scene complexity grows.


Level 1: Rust + WASM

The physics loop is pure arithmetic on a flat array — exactly the kind of work WASM was designed for. No DOM, no async, no string manipulation. Just math on contiguous memory.

The Rust Implementation

The WASM module takes a mutable slice of the same flat Float32Array and does the physics in Rust:

CODE_MANIFESTsrc/lib.rs
use wasm_bindgen::prelude::*;

/// Update particle positions and velocities in-place.
/// Layout per particle: [x, y, z, vx, vy, vz] (6 floats)
#[wasm_bindgen]
pub fn step_particles(
    data: &mut [f32],
    dt: f32,
    gravity: f32,
    bounds: f32,
    damping: f32,
) {
    let floats_per_particle = 6;
    let count = data.len() / floats_per_particle;

    for i in 0..count {
        let base = i * floats_per_particle;

        // Gravity
        data[base + 4] += gravity * dt;

        // Integrate velocity -> position
        data[base + 0] += data[base + 3] * dt;
        data[base + 1] += data[base + 4] * dt;
        data[base + 2] += data[base + 5] * dt;

        // Boundary reflection with damping
        for axis in 0..3 {
            let pos_idx = base + axis;
            let vel_idx = base + 3 + axis;
            if data[pos_idx] > bounds {
                data[pos_idx] = bounds;
                data[vel_idx] = -data[vel_idx] * damping;
            } else if data[pos_idx] < -bounds {
                data[pos_idx] = -bounds;
                data[vel_idx] = -data[vel_idx] * damping;
            }
        }
    }
}

The Rust code is structurally identical to the JavaScript version. The performance difference comes from: (1) no JIT warmup — WASM executes at near-native speed immediately, (2) no type checks — f32 is always f32, and (3) the Rust compiler's optimizer (LLVM) is more aggressive than V8's TurboFan for tight numeric loops.

Build and Integration

CODE_MANIFESTCargo.toml
[package]
name = "webgl-particle-demo"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

[dependencies]
wasm-bindgen = "0.2"

[profile.release]
opt-level = 3
lto = true
CODE_MANIFESTLANG: BASH
# Build with wasm-pack
wasm-pack build --target web --release

Calling it from JavaScript is seamless thanks to wasm-bindgen:

CODE_MANIFESTjs/main.js
import init, { step_particles } from '../pkg/webgl_particle_demo.js'

await init()

// Same data, same parameters — just call the WASM function
function animate() {
  const start = performance.now()
  step_particles(particleData, 1/60, -9.81, BOUNDS, 0.7)
  const physicsTime = performance.now() - start

  syncPositions(particleData, positionBuffer)
  renderer.render(scene, camera)
  requestAnimationFrame(animate)
}

Here's the honest result: for this particular loop (AoS layout, 50K particles), scalar WASM was not faster than JavaScript on Chrome/M1. V8's TurboFan JIT is extremely good at optimizing tight numeric loops on typed arrays. The real payoff comes later — when we combine WASM with SIMD and a data layout designed for vectorization.

When NOT to Use WASM

  • Your code is I/O bound (waiting on network, disk) rather than CPU bound
  • Operations already run within budget in JavaScript (< 16ms per frame)
  • Code runs rarely (one-time initialization, event handlers)
  • The overhead of copying data between JS and WASM exceeds the gains — for small arrays (< 1000 elements), JS is often faster due to marshalling cost
WASM Integration PipelineJavaScriptThree.js + UIcallWASM Modulestep_particles()compiles toNative CodeLLVM-optimizedexecutesCPUFloat32Arraymutated data returned (zero-copy with wasm-bindgen)

Level 2: SIMD — Processing 4 Floats Per Instruction

SIMD (Single Instruction, Multiple Data) lets the CPU operate on 4 floats simultaneously via 128-bit registers. Instead of adding positions one at a time, we load 4 values, add 4 values, store 4 values — all in a single instruction.

WASM SIMD maps directly to hardware SIMD on both x86 (SSE4.1) and ARM (NEON). The browser handles the translation.

SIMD Particle Physics

The velocity integration (position += velocity * dt) vectorizes well. Boundary checks don't — they involve branching per-particle. So we split the work: SIMD for integration, scalar for boundaries.

CODE_MANIFESTsrc/lib.rs
#[cfg(target_arch = "wasm32")]
use std::arch::wasm32::*;

#[wasm_bindgen]
pub fn step_particles_simd(
    data: &mut [f32],
    dt: f32,
    gravity: f32,
    bounds: f32,
    damping: f32,
) {
    let floats_per_particle = 6;
    let count = data.len() / floats_per_particle;

    // SIMD pass: velocity integration
    // Processes particles in pairs (2 particles = 12 floats = 3 x v128)
    let vdt = f32x4_splat(dt);
    let pairs = count / 2;

    for pair in 0..pairs {
        let base = pair * 12;

        // Apply gravity to both particles' vy
        data[base + 4] += gravity * dt;
        data[base + 10] += gravity * dt;

        // SIMD: load 4 consecutive floats, multiply by dt, add to position
        unsafe {
            let pos = v128_load(data.as_ptr().add(base) as *const v128);
            let vel_offset = v128_load(data.as_ptr().add(base + 3) as *const v128);
            let updated = f32x4_add(pos, f32x4_mul(vel_offset, vdt));
            v128_store(data.as_mut_ptr().add(base) as *mut v128, updated);
        }

        // Scalar: remaining components
        data[base + 1] += data[base + 4] * dt;
        data[base + 2] += data[base + 5] * dt;
        data[base + 7] += data[base + 10] * dt;
        data[base + 8] += data[base + 11] * dt;
    }

    // Scalar pass: boundary reflection (branching doesn't vectorize)
    for i in 0..count {
        let base = i * floats_per_particle;
        for axis in 0..3 {
            let pos_idx = base + axis;
            let vel_idx = base + 3 + axis;
            if data[pos_idx] > bounds {
                data[pos_idx] = bounds;
                data[vel_idx] = -data[vel_idx] * damping;
            } else if data[pos_idx] < -bounds {
                data[pos_idx] = -bounds;
                data[vel_idx] = -data[vel_idx] * damping;
            }
        }
    }
}

SIMD Matrix Multiply

Where SIMD really shines is linear algebra. A 4x4 matrix multiply in scalar code is 64 multiplications and 48 additions. With SIMD, each output column is computed with 4 multiply+add operations on full 128-bit vectors:

CODE_MANIFESTsrc/lib.rs
/// SIMD 4x4 matrix multiply (column-major, matches WebGL/Three.js convention).
#[wasm_bindgen]
pub fn simd_mat4_multiply(a: &[f32], b: &[f32]) -> Vec<f32> {
    assert!(a.len() >= 16 && b.len() >= 16);
    let mut out = vec![0.0f32; 16];

    unsafe {
        let a_col0 = v128_load(a.as_ptr().add(0) as *const v128);
        let a_col1 = v128_load(a.as_ptr().add(4) as *const v128);
        let a_col2 = v128_load(a.as_ptr().add(8) as *const v128);
        let a_col3 = v128_load(a.as_ptr().add(12) as *const v128);

        for col in 0..4 {
            let b_base = col * 4;
            let b0 = f32x4_splat(b[b_base]);
            let b1 = f32x4_splat(b[b_base + 1]);
            let b2 = f32x4_splat(b[b_base + 2]);
            let b3 = f32x4_splat(b[b_base + 3]);

            // out_col = a_col0*b0 + a_col1*b1 + a_col2*b2 + a_col3*b3
            let r = f32x4_add(
                f32x4_add(f32x4_mul(a_col0, b0), f32x4_mul(a_col1, b1)),
                f32x4_add(f32x4_mul(a_col2, b2), f32x4_mul(a_col3, b3)),
            );

            let tmp: [f32; 4] = std::mem::transmute(r);
            out[col * 4..col * 4 + 4].copy_from_slice(&tmp);
        }
    }

    out
}

This matters when you're transforming thousands of vertices per frame. Batch-multiplying 10K matrices with SIMD runs 3-5x faster than scalar WASM.

Building with SIMD

CODE_MANIFESTLANG: BASH
# Enable SIMD in the compiled WASM output
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web --release

Memory Alignment Matters

SIMD loads (v128_load) work best on 16-byte-aligned memory. Misaligned loads still work on modern hardware but may be slower on some platforms. For hot data structures, enforce alignment:

CODE_MANIFESTLANG: RUST
#[repr(align(16))]
struct AlignedParticles {
    data: Vec<f32>,
}

// For the particle system, Float32Array is already 4-byte aligned,
// and WASM v128_load handles unaligned access correctly.
// Explicit alignment matters more for custom SIMD data structures.

When SIMD Doesn't Help

SIMD is a poor fit when: data sets are small (< 1000 elements, overhead dominates), operations involve heavy branching (if/else per element), or memory access is unpredictable (random indices, pointer chasing). Our boundary checks are a good example -- they branch per particle, so they stay scalar.

Browser Support

BrowserWASM SIMD SupportMinimum Version
ChromeYes91+ (May 2021)
FirefoxYes89+ (June 2021)
SafariYes16.4+ (March 2023)
EdgeYes91+ (May 2021)

Feature Detection

CODE_MANIFESTfeature-detect.js
import { simd } from 'wasm-feature-detect'

const simdSupported = await simd()

if (simdSupported) {
  // Load SIMD-enabled WASM build
  const { step_particles_simd } = await import('./pkg/simd/webgl_particle_demo.js')
} else {
  // Fallback to scalar WASM
  const { step_particles } = await import('./pkg/scalar/webgl_particle_demo.js')
}

Level 3: Web Workers + SharedArrayBuffer

WASM+SIMD makes the physics faster per-core, but we're still single-threaded. Modern devices have 4-8+ cores sitting idle. Web Workers let us split the particle array across cores — each worker updates its range, and the main thread just reads the shared memory for rendering.

The Architecture

The key insight: each particle's physics is independent. Worker 1 updates particles 0-12,499, Worker 2 updates 12,500-24,999, and so on. No locks needed because the memory ranges don't overlap.

CODE_MANIFESTjs/main.js
// Allocate shared memory (accessible by all workers simultaneously)
const sharedBuffer = new SharedArrayBuffer(
  PARTICLE_COUNT * FLOATS_PER_PARTICLE * Float32Array.BYTES_PER_ELEMENT
)
const particleData = new Float32Array(sharedBuffer)

// Spawn workers, each owns a non-overlapping range
const WORKER_COUNT = Math.min(navigator.hardwareConcurrency || 2, 4)
const particlesPerWorker = Math.floor(PARTICLE_COUNT / WORKER_COUNT)

for (let i = 0; i < WORKER_COUNT; i++) {
  const worker = new Worker(new URL('./particle-worker.js', import.meta.url))
  const start = i * particlesPerWorker
  const end = i === WORKER_COUNT - 1 ? PARTICLE_COUNT : (i + 1) * particlesPerWorker

  // Pass the shared buffer — no copying, workers see the same memory
  worker.postMessage({ type: 'init', buffer: sharedBuffer, start, end })
}
CODE_MANIFESTjs/particle-worker.js
const FLOATS_PER_PARTICLE = 6
const GRAVITY = -9.81
const BOUNDS = 50.0
const DAMPING = 0.7
const DT = 1 / 60

let particles = null
let startIdx = 0
let endIdx = 0

self.onmessage = (e) => {
  if (e.data.type === 'init') {
    // Wrap the shared buffer — same memory, no copy
    particles = new Float32Array(e.data.buffer)
    startIdx = e.data.start
    endIdx = e.data.end
    tick()
  }
}

function tick() {
  for (let i = startIdx; i < endIdx; i++) {
    const base = i * FLOATS_PER_PARTICLE

    particles[base + 4] += GRAVITY * DT
    particles[base + 0] += particles[base + 3] * DT
    particles[base + 1] += particles[base + 4] * DT
    particles[base + 2] += particles[base + 5] * DT

    for (let axis = 0; axis < 3; axis++) {
      const posIdx = base + axis
      const velIdx = base + 3 + axis
      if (particles[posIdx] > BOUNDS) {
        particles[posIdx] = BOUNDS
        particles[velIdx] = -particles[velIdx] * DAMPING
      } else if (particles[posIdx] < -BOUNDS) {
        particles[posIdx] = -BOUNDS
        particles[velIdx] = -particles[velIdx] * DAMPING
      }
    }
  }

  setTimeout(tick, 16) // ~60fps
}

The render loop on the main thread just reads from the shared buffer — the workers are updating it continuously in the background:

CODE_MANIFESTLANG: JAVASCRIPT
function animate() {
  // Workers are writing positions in the background.
  // We just read from the same SharedArrayBuffer — zero copy.
  syncPositions(particleData, positionBuffer)
  geometry.attributes.position.needsUpdate = true
  renderer.render(scene, camera)
  requestAnimationFrame(animate)
}

SharedArrayBuffer Requires Cross-Origin Isolation

Your server must set these headers, otherwise SharedArrayBuffer will be undefined: ` Cross-Origin-Opener-Policy: same-origin Cross-Origin-Embedder-Policy: require-corp ` This can break third-party iframes and scripts that don't support CORP. Test thoroughly.

When Workers Need the Same Data: Atomics

Our particle system doesn't need synchronization because workers own non-overlapping ranges. But what about shared state — a collision counter, a global force accumulator, a spatial hash grid? That's where Atomics comes in.

CODE_MANIFESTLANG: JAVASCRIPT
// Shared state array (separate from particle data)
const stateBuffer = new SharedArrayBuffer(16 * Int32Array.BYTES_PER_ELEMENT)
const state = new Int32Array(stateBuffer)

// Worker: atomically increment collision counter
Atomics.add(state, 0, 1) // index 0 = collision count

// Worker: CAS (Compare-And-Swap) for more complex updates
let old, updated
do {
  old = Atomics.load(state, 1)
  updated = old + computeForce()
} while (Atomics.compareExchange(state, 1, old, updated) !== old)

// Main thread: wait for workers to signal completion
Atomics.wait(state, 2, 0) // block until index 2 != 0

// Worker: signal main thread
Atomics.store(state, 2, 1)
Atomics.notify(state, 2) // wake up waiting thread

Start Here

JavaScript's built-in Atomics API covers 90% of multithreading use cases. It's simple, safe, and doesn't require WASM. Only reach for advanced lock-free techniques when profiling shows contention is the bottleneck.

Browser Support for SharedArrayBuffer

BrowserSupportRequirements
Chrome68+Cross-Origin Isolation headers
Firefox79+Cross-Origin Isolation headers
Safari15.2+Cross-Origin Isolation headers
Edge79+Cross-Origin Isolation headers
CODE_MANIFESTfeature-detect.js
import { simd, threads, bulkMemory } from 'wasm-feature-detect'

async function detectCapabilities() {
  const features = {
    simd: await simd(),
    threads: await threads(),
    bulkMemory: await bulkMemory(),
    sharedArrayBuffer: typeof SharedArrayBuffer !== 'undefined',
    crossOriginIsolated: crossOriginIsolated ?? false,
  }

  console.table(features)

  if (!features.sharedArrayBuffer || !features.crossOriginIsolated) {
    console.warn('Multi-threading unavailable. Falling back to single-thread WASM.')
  }

  return features
}
Multi-threaded Particle ArchitectureWEB WORKERSWorker 1 [0..12K]Worker 2 [12K..25K]Worker 3 [25K..37K]Worker 4 [37K..50K]SharedArrayBuffer300K floats (1.2MB)No copies -- all threads see same memoryMain ThreadThree.js renderreads positionswritereadEach worker owns a non-overlapping particle range. No locks needed.

Level 4: Lock-Free Concurrency (When You Actually Need It)

You Probably Don't Need This

Lock Striping is an advanced technique for high-contention scenarios -- many workers frequently competing for the same resources. For particle systems, non-overlapping ranges (Level 3) are almost always sufficient. Only proceed if profiling shows lock contention is your bottleneck.

But sometimes workers do need shared mutable state — a spatial hash grid for collision detection, a shared resource pool, a concurrent command buffer. When Atomics.add isn't enough and a single global lock creates a bottleneck, Lock Striping divides the lock into N independent stripes.

How Lock Striping Works

Instead of one lock protecting all shared resources, you create a table of locks. Each resource maps to a stripe via resource_index % stripe_count. Workers only block each other when they happen to access the same stripe — which, with enough stripes, is rare.

CODE_MANIFESTsrc/lib.rs
use std::sync::atomic::{AtomicI32, Ordering};

#[wasm_bindgen]
pub struct StripedLockTable {
    locks: Vec<AtomicI32>,
    stripe_count: usize,
}

#[wasm_bindgen]
impl StripedLockTable {
    #[wasm_bindgen(constructor)]
    pub fn new(stripe_count: usize) -> Self {
        let mut locks = Vec::with_capacity(stripe_count);
        for _ in 0..stripe_count {
            locks.push(AtomicI32::new(0)); // 0 = unlocked
        }
        Self { locks, stripe_count }
    }

    /// Map a resource index to a stripe.
    #[wasm_bindgen(js_name = stripeFor)]
    pub fn stripe_for(&self, resource_index: usize) -> usize {
        resource_index % self.stripe_count
    }

    /// Try to acquire using CAS. Returns true if lock acquired.
    #[wasm_bindgen(js_name = tryAcquire)]
    pub fn try_acquire(&self, stripe: usize) -> bool {
        if stripe >= self.stripe_count { return false }
        // Atomic CAS: swap 0 (unlocked) -> 1 (locked)
        self.locks[stripe]
            .compare_exchange(0, 1, Ordering::Acquire, Ordering::Relaxed)
            .is_ok()
    }

    /// Release the stripe lock.
    #[wasm_bindgen]
    pub fn release(&self, stripe: usize) {
        if stripe < self.stripe_count {
            self.locks[stripe].store(0, Ordering::Release);
        }
    }
}
CODE_MANIFESTlock-striping-usage.js
import init, { StripedLockTable } from './pkg/webgl_particle_demo.js'
await init()

const locks = new StripedLockTable(16) // 16 stripes

async function accessSharedResource(resourceIndex) {
  const stripe = locks.stripeFor(resourceIndex)
  const MAX_RETRIES = 100

  for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
    if (locks.tryAcquire(stripe)) {
      try {
        // Safe to read/write the shared resource
        updateSpatialHashCell(resourceIndex)
      } finally {
        locks.release(stripe)
      }
      return
    }
    // Exponential backoff: 1ms, 2ms, 4ms... max 50ms
    await new Promise(r => setTimeout(r, Math.min(2 ** attempt, 50)))
  }

  console.error(`Failed to acquire stripe ${stripe} after ${MAX_RETRIES} retries`)
}
Lock Striping CAS FlowHash ResourceFind StripeCASOKAcquiredDo WorkFAILBackoff + RetrySTRIPE TABLE (16 independent locks)S0S1S2S3S4S5S6S7...S14S15green = free, amber = locked

Benchmarks

Benchmark Methodology

All benchmarks run in-browser using performance.now(). Each test does 100 warmup iterations (to stabilize JIT), then 1000 timed iterations (500 for particle step), repeated 5 times. We report the median of those 5 runs. Hardware: Apple M1 MacBook Pro, 16GB RAM, Chrome. Benchmarks use in-place operations (pre-allocated output buffers) to isolate computation from allocation overhead. The particle step uses Structure-of-Arrays (SoA) layout for full SIMD lane utilization. Your numbers will differ -- the ratios are what matter, not the absolute times. Run window.runBenchmarks() in the demo to see your own results.

TestJavaScriptWASM ScalarWASM+SIMDWASM vs JSSIMD vs JS
Vector Add (100K floats, 1000 ops)413ms75ms44ms5.5x9.5x
4x4 Matrix Multiply (10K ops)0.66ms1.88ms1.89ms0.35x0.35x
Particle Step (50K particles, 500 frames)116ms161ms32ms0.72x3.6x

The results reveal something important: WASM isn't universally faster than JavaScript. For the 4x4 matrix multiply (only 16 floats), V8's TurboFan JIT optimizes the tight loop so aggressively that the WASM function-call overhead actually makes it slower. The scalar particle step shows a similar pattern -- V8 is competitive on the AoS loop. But SIMD changes the equation dramatically: the SoA particle step with full SIMD is 3.6x faster than JS, and vector addition hits 9.5x. The takeaway: WASM's advantage scales with data size and SIMD utilization, not with code complexity.

Web Workers add another dimension. With 4 workers on an M1 (4 performance cores), physics runs concurrently on all cores while the main thread focuses on rendering. For embarrassingly parallel workloads like our particle system (non-overlapping ranges), you get near-linear scaling up to the core count.

Typical Gains by Technique

TechniqueMeasured SpeedupBest ForComplexity
WASM (scalar)0.3-5.5x vs JS (depends on data size)Large-array numeric compute (> 10K elements)Medium -- Rust toolchain, wasm-pack
WASM + SIMD3.6-9.5x vs JSVectorizable math on large datasets (SoA layout)High -- SIMD intrinsics, data layout matters
Web Workers (4 cores)~3x vs single threadEmbarrassingly parallel workloadsLow -- SharedArrayBuffer, message passing
Lock Striping20-40% vs global lockHigh-contention shared stateVery high -- CAS, backoff, stripe tuning

Key insight from benchmarking: data layout matters more than the language. Scalar WASM with AoS (Array of Structs) layout was actually slower than JS for our particle step. But switching to SoA (Structure of Arrays) -- where all x values are contiguous, all y values are contiguous, etc. -- let SIMD process 4 particles per instruction with 100% lane utilization. That single change turned a 0.72x regression into a 3.6x speedup.


When NOT to Use These Optimizations

Every technique adds complexity. Here's an honest assessment of when each one isn't worth it.

Skip WASM if

  • Your bottleneck is I/O (network, disk), not CPU
  • Operations fit within budget in JS (< 8ms per frame leaves room)
  • Code runs rarely (init, event handlers, not per-frame)
  • Data is small — marshalling overhead between JS and WASM can negate gains for arrays under ~1000 elements

Skip SIMD if

  • Data sets are small (< 1000 elements) — setup overhead dominates
  • Operations involve heavy branching (if/else per element doesn't vectorize)
  • Memory access is random (pointer chasing, hash lookups) — SIMD needs contiguous data
  • Safari support matters and you can't ship two WASM builds (SIMD + fallback)

Skip Workers if

  • You're already at 60fps without them
  • The workload isn't parallelizable (sequential dependencies between steps)
  • Communication overhead exceeds gains — if workers need to sync every frame, message passing can cost more than the parallelism saves
  • Cross-Origin Isolation headers break your third-party integrations

Skip Lock Striping if

  • Your workers can operate on non-overlapping data (like our particle ranges)
  • You have fewer than 4 workers contending on shared state
  • Simple Atomics.add / Atomics.compareExchange handles your coordination needs
  • The debugging complexity isn't justified — lock-free bugs are notoriously hard to reproduce

The Golden Rule

Profile first, optimize later. Don't add complexity unless measurements prove it's necessary. Use Chrome DevTools Performance tab to identify where your frame time actually goes.


Decision Framework

Implementation Path

  1. Start with JavaScript. Use flat TypedArrays, avoid object allocation in hot loops. Often this is enough.
  2. Profile. If physics/compute takes > 8ms per frame, continue.
  3. Add Web Workers if the work is parallelizable. SharedArrayBuffer + non-overlapping ranges is simple and effective.
  4. Move hot paths to WASM if single-thread performance is still the bottleneck. Rust + wasm-pack makes this straightforward.
  5. Add SIMD to WASM code for vectorizable operations (matrix math, batch transforms). Ship with a scalar fallback.
  6. Lock Striping only if profiling shows workers are contending on shared mutable state.

Technology Decision Matrix

TechniqueUse WhenComplexityTypical Gain
Web Workers + SharedArrayBufferParallelizable compute, > 2 cores availableLow~3x (4 cores)
Rust + WASM (scalar)Large arrays (100K+ elements), CPU-boundMedium0.3-5.5x (depends on data size)
WASM SIMD + SoA layoutVectorizable math, contiguous same-type dataHigh3.6-9.5x (vs JS)
Lock Striping4+ workers, high contention on shared stateVery High20-40% (vs global lock)

Who Actually Needs Each Level

Workers + Atomics (90% of cases): Particle systems, async physics, procedural generation, image processing pipelines.

Rust + WASM (10% of cases): Complex algorithms (pathfinding, fluid simulation, spatial indexing), large dataset processing, cryptography.

SIMD (5% of cases): Batch matrix/vector operations at scale, real-time audio DSP, image convolution filters.

Lock Striping (< 1% of cases): 8+ workers sharing a concurrent spatial hash, real-time multiplayer game engines, high-frequency data visualization.


Key Takeaways

  1. Data layout matters more than language. Switching from AoS to SoA turned a 0.72x WASM regression into a 3.6x SIMD speedup. Think about your memory layout before reaching for WASM.
  2. WASM isn't automatically faster. V8's TurboFan is extremely competitive on small, tight loops. WASM's advantage shows at scale (100K+ elements) and with SIMD.
  3. SIMD is the real multiplier. Vector addition: 9.5x faster than JS. Particle physics (SoA): 3.6x. But it requires contiguous same-type data — SoA, not AoS.
  4. Web Workers scale with cores for embarrassingly parallel work. Non-overlapping data ranges mean zero synchronization overhead.
  5. Lock Striping is a last resort. Atomics.add and Atomics.compareExchange handle most concurrency needs without custom lock tables.
  6. Profile before optimizing. Our benchmarks proved that scalar WASM can be slower than JS. Measure, don't assume.

The full source code for the particle demo — including the Rust WASM module, Three.js renderer, Web Workers, and benchmark harness — is available in the live demo.