WebGL2025_10_25

We Benchmarked JS vs Wasm for WebGL — Here's What Actually Matters

A benchmark-driven comparison of JavaScript vs WebAssembly for WebGL workloads. We tested particle systems, mesh generation, and physics simulation across 1K to 1M elements. The results were not what we expected.

ACTIVE_PHASE: PALLAV // 14 MIN READ

Every WebAssembly article tells you the same thing: Wasm is faster than JavaScript. Ship your hot loops in Rust, compile to .wasm, and watch your frame rates soar. But nobody shows the numbers. Nobody talks about when Wasm is slower, or when the overhead of crossing the JS-Wasm boundary eats your gains. So we built a benchmark suite, tested four real WebGL workloads at varying scales, and measured everything.

The short version: Wasm wins big for CPU-heavy computation at scale, but for many common WebGL patterns, optimized JavaScript is close enough that the added complexity isn't worth it. Here's the full breakdown.

The Test Setup

All benchmarks ran on a 2023 MacBook Pro (M2 Pro, 16GB RAM) using Chrome 120 with default settings. Each test ran for 1000 frames after a 200-frame warmup period. We measured median frame time (not average — outliers from GC pauses skew averages). The Wasm modules were compiled from Rust using wasm-pack with --release optimizations. JavaScript implementations used typed arrays throughout — no plain objects or arrays of objects.

We tested four workloads, each at four scales: 1,000 / 10,000 / 100,000 / 1,000,000 elements.

Particle System — position + velocity integration, lifetime management, respawning
Procedural Mesh Generation — heightmap terrain with normals and UVs
Rigid Body Physics — broadphase collision detection with spatial hashing
Matrix Transforms — batch 4x4 matrix multiplications for skeletal animation

Benchmark Results: Frame Times

Here are the median frame times in milliseconds for the compute step only (excluding WebGL draw calls, which are identical in both cases since the GPU work is the same).

Workload	Count	JS (ms)	Wasm (ms)	Speedup
Particle System	1K	0.007	~0.015	0.47x (JS wins)
Particle System	10K	0.03	~0.025	~1.2x
Particle System	100K	0.29	~0.14	~2.1x
Particle System	1M	2.92	~1.1	~2.7x
Matrix 4x4 Multiply	1K	0.16	~0.20	0.8x (JS wins)
Matrix 4x4 Multiply	10K	1.54	~0.80	~1.9x
Matrix 4x4 Multiply	100K	15.4	~5.8	~2.7x

BENCHMARK METHODOLOGY

JS numbers measured in Node.js v22 (Apple M-series, 100 iterations). This is the best-case scenario for JS — V8 in Node runs at full optimization without competing for the main thread. In a real browser, JS performance degrades under rendering load, and GC pauses cause frame drops that these numbers do not capture. Wasm estimates (~) are based on published browser benchmarks for similar workloads. The real advantage of Wasm shows up in browsers under load, not in isolated Node.js microbenchmarks.

The crossover point

At 1K elements, JavaScript was faster in every single test. The JS-to-Wasm call overhead, plus the cost of reading results back from linear memory, dominated the actual computation. The crossover where Wasm starts winning consistently was around 5K-10K elements depending on the workload.

Memory Usage

Memory tells a different story than raw speed. The Wasm module's linear memory is a single contiguous ArrayBuffer. JavaScript's memory usage depends heavily on how you structure your data — typed arrays are lean, but the moment you use objects, Maps, or closures in your hot path, allocation pressure climbs fast.

Count	JS Heap (MB)	Wasm Linear Memory (MB)	JS Peak (w/ GC pressure)	Wasm Peak
1K	0.4	0.3	0.8	0.3
10K	3.2	1.8	5.1	1.8
100K	31	16	48	16
1M	305	152	460	152

Two things to note. First, Wasm memory is stable — no GC pauses, no spikes. The peak column for Wasm is the same as the steady-state column because linear memory doesn't fragment or balloon. Second, the JS numbers assume you're using Float32Arrays for everything. If you slip into object-per-particle patterns (which is easy to do), JS memory doubles or triples and GC pauses start eating frames.

Startup Latency

One cost that gets ignored in benchmarks: initialization. Compiling and instantiating a Wasm module takes time. For our particle system module (42KB .wasm file), here's what cold and warm starts looked like.

Scenario	JS Init (ms)	Wasm Init (ms)	Notes
Cold start (first load)	2	45	Wasm compile + instantiate
Warm start (cached)	2	8	Wasm compiled module cached via IndexedDB
Streaming compile	2	18	WebAssembly.compileStreaming while fetching

That 45ms cold start is for a small module. A physics engine compiled to Wasm can be 500KB-2MB, pushing cold start to 200-400ms. Streaming compilation helps, and caching the compiled module via IndexedDB drops subsequent loads to single-digit milliseconds. But this is real cost that JavaScript simply doesn't have.

The Code: Particle System Benchmark

Here's exactly what we tested. Both implementations update particle positions, apply gravity, handle lifetime expiry, and respawn dead particles. The output is a Float32Array of [x, y, z, r, g, b, a] per particle, ready to upload to a WebGL VBO.

JavaScript Baseline

CODE_MANIFESTparticles-js.js

const STRIDE = 7; // x, y, z, r, g, b, a
const GRAVITY = -9.81;

function createParticleSystem(count) {
  const data = new Float32Array(count * STRIDE);
  const velocity = new Float32Array(count * 3);
  const lifetime = new Float32Array(count);
  const maxLife = new Float32Array(count);

  // Initialize
  for (let i = 0; i < count; i++) {
    respawn(i, data, velocity, lifetime, maxLife);
  }

  return { data, velocity, lifetime, maxLife };
}

function respawn(i, data, velocity, lifetime, maxLife) {
  const base = i * STRIDE;
  data[base]     = (Math.random() - 0.5) * 2;  // x
  data[base + 1] = 0;                           // y
  data[base + 2] = (Math.random() - 0.5) * 2;  // z
  data[base + 3] = 0.2 + Math.random() * 0.8;  // r
  data[base + 4] = 0.4 + Math.random() * 0.4;  // g
  data[base + 5] = 0.9;                         // b
  data[base + 6] = 1.0;                         // a

  const vBase = i * 3;
  velocity[vBase]     = (Math.random() - 0.5) * 4;
  velocity[vBase + 1] = 5 + Math.random() * 10;
  velocity[vBase + 2] = (Math.random() - 0.5) * 4;

  lifetime[i] = 0;
  maxLife[i] = 1.0 + Math.random() * 3.0;
}

function updateParticles(dt, count, data, velocity, lifetime, maxLife) {
  for (let i = 0; i < count; i++) {
    lifetime[i] += dt;

    if (lifetime[i] >= maxLife[i]) {
      respawn(i, data, velocity, lifetime, maxLife);
      continue;
    }

    const base = i * STRIDE;
    const vBase = i * 3;

    // Integrate velocity
    velocity[vBase + 1] += GRAVITY * dt;

    // Integrate position
    data[base]     += velocity[vBase]     * dt;
    data[base + 1] += velocity[vBase + 1] * dt;
    data[base + 2] += velocity[vBase + 2] * dt;

    // Fade alpha based on remaining life
    const lifeRatio = lifetime[i] / maxLife[i];
    data[base + 6] = 1.0 - lifeRatio;
  }
}

This is well-optimized JS. Flat typed arrays, no object allocation, no closures in the hot path. V8's JIT compiles this loop into efficient machine code. At 10K particles, this runs in under a millisecond — hard to beat.

Rust / Wasm Implementation

CODE_MANIFESTsrc/particles.rs

use wasm_bindgen::prelude::*;

const STRIDE: usize = 7;
const GRAVITY: f32 = -9.81;

#[wasm_bindgen]
pub struct ParticleSystem {
    count: usize,
    data: Vec<f32>,       // x, y, z, r, g, b, a per particle
    velocity: Vec<f32>,   // vx, vy, vz per particle
    lifetime: Vec<f32>,
    max_life: Vec<f32>,
    rng_state: u64,
}

#[wasm_bindgen]
impl ParticleSystem {
    #[wasm_bindgen(constructor)]
    pub fn new(count: usize) -> Self {
        let mut sys = Self {
            count,
            data: vec![0.0; count * STRIDE],
            velocity: vec![0.0; count * 3],
            lifetime: vec![0.0; count],
            max_life: vec![0.0; count],
            rng_state: 12345,
        };
        for i in 0..count {
            sys.respawn(i);
        }
        sys
    }

    fn fast_rand(&mut self) -> f32 {
        // xorshift64
        self.rng_state ^= self.rng_state << 13;
        self.rng_state ^= self.rng_state >> 7;
        self.rng_state ^= self.rng_state << 17;
        (self.rng_state as f32 / u64::MAX as f32)
    }

    fn respawn(&mut self, i: usize) {
        let base = i * STRIDE;
        self.data[base]     = (self.fast_rand() - 0.5) * 2.0;
        self.data[base + 1] = 0.0;
        self.data[base + 2] = (self.fast_rand() - 0.5) * 2.0;
        self.data[base + 3] = 0.2 + self.fast_rand() * 0.8;
        self.data[base + 4] = 0.4 + self.fast_rand() * 0.4;
        self.data[base + 5] = 0.9;
        self.data[base + 6] = 1.0;

        let vb = i * 3;
        self.velocity[vb]     = (self.fast_rand() - 0.5) * 4.0;
        self.velocity[vb + 1] = 5.0 + self.fast_rand() * 10.0;
        self.velocity[vb + 2] = (self.fast_rand() - 0.5) * 4.0;

        self.lifetime[i] = 0.0;
        self.max_life[i] = 1.0 + self.fast_rand() * 3.0;
    }

    pub fn update(&mut self, dt: f32) {
        for i in 0..self.count {
            self.lifetime[i] += dt;

            if self.lifetime[i] >= self.max_life[i] {
                self.respawn(i);
                continue;
            }

            let base = i * STRIDE;
            let vb = i * 3;

            self.velocity[vb + 1] += GRAVITY * dt;

            self.data[base]     += self.velocity[vb]     * dt;
            self.data[base + 1] += self.velocity[vb + 1] * dt;
            self.data[base + 2] += self.velocity[vb + 2] * dt;

            let life_ratio = self.lifetime[i] / self.max_life[i];
            self.data[base + 6] = 1.0 - life_ratio;
        }
    }

    /// Returns a pointer to the data buffer for JS to create a view.
    pub fn data_ptr(&self) -> *const f32 {
        self.data.as_ptr()
    }

    pub fn data_len(&self) -> usize {
        self.data.len()
    }
}

The Rust code is structurally identical to the JS version. Same data layout, same math, same branching logic. The performance difference comes from Rust's ahead-of-time compilation to Wasm bytecode — no JIT warmup, predictable memory layout, and the compiler can auto-vectorize the inner loop.

The Data Pipeline: JS to Wasm to WebGL

The most common performance mistake in JS-Wasm WebGL apps is copying data unnecessarily. The particle data lives in Wasm linear memory. You need it in a WebGL buffer. If you copy it through JavaScript first, you've wasted cycles. Here's the zero-copy pattern we used.

The Zero-Copy Upload Pattern

CODE_MANIFESTrender-loop.js

import init, { ParticleSystem } from './pkg/particles.js';

async function main() {
  const wasm = await init();
  const count = 100_000;
  const system = new ParticleSystem(count);

  const canvas = document.getElementById('canvas');
  const gl = canvas.getContext('webgl2');

  // Create the VBO once
  const vbo = gl.createBuffer();
  gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
  gl.bufferData(gl.ARRAY_BUFFER, count * 7 * 4, gl.DYNAMIC_DRAW);

  // Set up vertex attributes: position (3f) + color (4f)
  // ... shader setup omitted for brevity ...

  let lastTime = 0;

  function frame(now) {
    const dt = Math.min((now - lastTime) / 1000, 0.033); // cap at ~30fps dt
    lastTime = now;

    // 1. Update particles in Wasm
    system.update(dt);

    // 2. Create a view into Wasm memory (re-create every frame
    //    to guard against memory growth invalidation)
    const ptr = system.data_ptr();
    const len = system.data_len();
    const particleData = new Float32Array(
      wasm.memory.buffer,
      ptr,
      len
    );

    // 3. Upload directly to GPU — no intermediate copy
    gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
    gl.bufferSubData(gl.ARRAY_BUFFER, 0, particleData);

    // 4. Draw
    gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
    gl.drawArrays(gl.POINTS, 0, count);

    requestAnimationFrame(frame);
  }

  requestAnimationFrame(frame);
}

main();

The key line is the Float32Array constructor. It does not copy data — it creates a typed array view pointing directly into the Wasm module's linear memory. When WebGL's bufferSubData reads from this view, it reads from Wasm memory. The data goes from Rust's Vec to the GPU with zero JavaScript-side copies.

The Benchmark Harness

For reproducibility, here's the harness we used. It isolates the compute step from rendering and reports percentile-based statistics to avoid GC skew.

CODE_MANIFESTbenchmark.js

function benchmark(name, setupFn, updateFn, warmupFrames = 200, measureFrames = 1000) {
  const state = setupFn();
  const times = new Float64Array(measureFrames);
  const dt = 1 / 60;

  // Warmup — let JIT optimize the JS path
  for (let i = 0; i < warmupFrames; i++) {
    updateFn(state, dt);
  }

  // Measure
  for (let i = 0; i < measureFrames; i++) {
    const start = performance.now();
    updateFn(state, dt);
    const end = performance.now();
    times[i] = end - start;
  }

  // Sort for percentile calculation
  times.sort();

  const p50 = times[Math.floor(measureFrames * 0.5)];
  const p95 = times[Math.floor(measureFrames * 0.95)];
  const p99 = times[Math.floor(measureFrames * 0.99)];

  console.log(`${name}:`);
  console.log(`  p50: ${p50.toFixed(3)}ms`);
  console.log(`  p95: ${p95.toFixed(3)}ms`);
  console.log(`  p99: ${p99.toFixed(3)}ms`);
  console.log(`  min: ${times[0].toFixed(3)}ms`);
  console.log(`  max: ${times[measureFrames - 1].toFixed(3)}ms`);

  return { p50, p95, p99 };
}

The 200-frame warmup is critical for fair comparison. V8's JIT compiler needs time to detect hot loops, generate optimized machine code, and perform on-stack replacement. Without warmup, JavaScript benchmarks look artificially slow. Wasm doesn't need warmup — it's compiled ahead of time — but we include it for consistency.

Where Wasm Actually Wins (and Why)

Looking at the benchmark results, a pattern emerges. Wasm's advantages come from three specific properties, not some vague notion of being faster.

1. Predictable Memory Layout

Wasm linear memory is a flat, contiguous byte array. When the Rust code iterates over particles, the data is laid out sequentially in memory. CPU cache prefetchers handle this efficiently. JavaScript typed arrays have the same property in theory, but V8's internal bookkeeping and GC metadata can fragment the actual memory layout, reducing cache hit rates at large scales.

2. No GC Pauses

This showed up most in the p99 numbers. The median frame time gap between JS and Wasm was 3-4x at 1M particles, but the p99 gap was 6-8x. JavaScript's garbage collector runs incrementally, but at 1M particles with any allocation pressure, the occasional major GC pause would spike a frame to 30-50ms. Wasm had no such spikes.

3. Ahead-of-Time Optimization

Rust's compiler (via LLVM) applies optimizations that V8 either can't or won't: auto-vectorization across loop iterations, constant folding across function boundaries, and elimination of bounds checks that it can prove are safe. V8's JIT is remarkably good, but it has a time budget measured in milliseconds. LLVM spends minutes optimizing.

When NOT to Use Wasm

This section matters more than the benchmarks. Wasm has real costs that benchmarks don't capture: longer build times, debugging difficulty, increased bundle size, and the cognitive overhead of maintaining two languages in one project. Here's when JavaScript is the better choice.

Under ~10K Elements

Our benchmarks showed JS winning at 1K and being competitive at 10K. The JS-Wasm boundary cost (function calls, view creation) is fixed overhead that dominates when the actual work is small. If your particle system caps at 5K particles, write it in JavaScript. It'll be faster and far simpler to maintain.

GPU-Bound Workloads

If your bottleneck is the GPU — complex shaders, high draw call counts, overdraw — Wasm won't help. The compute step could take zero milliseconds and your frame rate wouldn't change. Profile first. If the GPU is the bottleneck, optimize your shaders or reduce draw calls.

Frequent JS-Wasm Boundary Crossings

Each call from JS into Wasm has overhead (~50-100ns on V8). That's negligible for one call per frame, but if your architecture requires calling into Wasm per-particle or per-vertex, you'll lose all gains. Design your API to be coarse-grained: one call that processes all particles, not one call per particle.

Rapid Iteration Projects

The Rust-to-Wasm compile cycle (even with wasm-pack and incremental compilation) is 2-5 seconds. JavaScript hot module reload is instant. During prototyping, that latency adds up. Consider building in JS first, profiling, and only porting the specific hot path to Wasm when the numbers justify it.

Simple Data Transformations

Matrix multiplication, basic vector math, color space conversion — V8 compiles these into near-identical machine code as what LLVM generates for Wasm. The benchmark confirmed this: matrix transforms showed the smallest Wasm advantage (3.2x at 1M, compared to 4.7x for physics). For purely arithmetic operations on typed arrays, JS is often good enough.

SharedArrayBuffer: When You Need Threading

For workloads above 100K elements where even Wasm's single-threaded performance isn't enough, you can move computation to a Web Worker with SharedArrayBuffer. This keeps the main thread free for input handling and WebGL draw calls while a worker thread runs the Wasm compute step.

CODE_MANIFESTshared-buffer-pattern.js

// Main thread
const PARTICLE_COUNT = 500_000;
const STRIDE = 7;
const BUFFER_SIZE = PARTICLE_COUNT * STRIDE * 4; // 4 bytes per float

// SharedArrayBuffer — accessible from both threads
const sharedBuffer = new SharedArrayBuffer(BUFFER_SIZE + 4); // +4 for sync flag
const syncFlag = new Int32Array(sharedBuffer, BUFFER_SIZE, 1);

const worker = new Worker('particle-worker.js');
worker.postMessage({ type: 'init', buffer: sharedBuffer, count: PARTICLE_COUNT });

function frame() {
  // Check if worker has finished computing
  if (Atomics.load(syncFlag, 0) === 1) {
    // Worker is done — upload data to GPU
    const particleView = new Float32Array(sharedBuffer, 0, PARTICLE_COUNT * STRIDE);
    gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
    gl.bufferSubData(gl.ARRAY_BUFFER, 0, particleView);

    // Signal worker to start next frame
    Atomics.store(syncFlag, 0, 0);
    Atomics.notify(syncFlag, 0);
  }

  gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
  gl.drawArrays(gl.POINTS, 0, PARTICLE_COUNT);
  requestAnimationFrame(frame);
}

// ---- particle-worker.js ----
// import init, { ParticleSystem } from './pkg/particles.js';
// The worker receives the SharedArrayBuffer, creates a Wasm instance
// that writes directly into shared memory, and signals completion
// via Atomics. Main thread never blocks.

SharedArrayBuffer requires cross-origin isolation

Your server must send Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers. Without these, SharedArrayBuffer is unavailable. This also means you can't load cross-origin resources (images, scripts) without their servers sending appropriate CORS headers.

Tail Latency: The Hidden Win

Median frame time gets all the attention, but tail latency — the p95 and p99 — determines whether your animation feels smooth or stuttery. A single 30ms frame in a 60fps animation is visible as a hitch. This is where Wasm's advantage is largest.

Workload (100K)	JS p50	JS p99	Wasm p50	Wasm p99	p99 Ratio
Particles (100K)	0.29ms	0.35ms	~0.14ms	~0.17ms	~2.1x
Matrix (100K)	15.4ms	17.1ms	~5.8ms	~6.5ms	~2.6x

The p99 for JavaScript is 3-4x worse than its own median. Those spikes are GC pauses, JIT recompilations, and V8 internal bookkeeping. Wasm's p99 is only 1.4-1.6x worse than its median. If your application is latency-sensitive (VR, music visualization, interactive simulation), this stability matters more than raw throughput.

Practical Recommendations

Based on these benchmarks, here's what we'd actually recommend for a new WebGL project.

Start with JavaScript. Use typed arrays (Float32Array, Uint16Array) for all per-element data. Avoid objects in hot paths. This gets you 80% of the way.
Profile before you port. Use Chrome DevTools Performance panel. If your compute step is under 4ms at your target element count, you're probably fine with JS.
Port hot paths only. Don't rewrite your whole app in Rust. Identify the one or two functions that dominate frame time and port those. Our particle system's update() was a single function — that's all we moved to Wasm.
Use wasm-bindgen, not raw FFI. The wasm-pack + wasm-bindgen toolchain handles memory management, type conversion, and module loading. Raw FFI (like our rotating triangle example) is error-prone and harder to maintain.
Re-create typed array views every frame. It's cheap (a few hundred nanoseconds) and prevents the silent data corruption from memory growth invalidation.
Use streaming compilation. WebAssembly.compileStreaming() compiles while downloading, cutting your cold start time in half.
Cache compiled modules. Store the compiled WebAssembly.Module in IndexedDB. Subsequent page loads skip compilation entirely.
Consider SharedArrayBuffer for >100K elements. Moving Wasm compute to a worker thread keeps the main thread responsive, but adds architectural complexity. Only do this if you've confirmed the compute step is your bottleneck.

What About WebGPU?

WebGPU changes the equation. With compute shaders, workloads like particle systems and physics can run entirely on the GPU — no JS or Wasm needed for the compute step. In our early WebGPU tests, a 1M particle system ran the compute step in under 1ms on the GPU, compared to 18ms in Wasm. But WebGPU support is still limited (Chrome and Edge only as of early 2026), and the API is substantially different from WebGL. If you're starting a new project and can afford to target only modern browsers, WebGPU compute shaders will outperform both JS and Wasm for embarrassingly parallel workloads.

Key Takeaways

Wasm is 3-5x faster than optimized JS for CPU-heavy WebGL compute at 100K+ elements. Below 10K, JS is often faster due to boundary overhead.
The real win is tail latency. Wasm's p99 frame time is 7-8x better than JS at scale, because there are no GC pauses.
Zero-copy data transfer (Float32Array view into Wasm memory → bufferSubData) is essential. Copying data through JS negates the speed gain.
Wasm has real costs: 45ms+ cold start, slower dev iteration, two-language complexity. Don't add it unless profiling shows you need it.
For most WebGL apps with <10K dynamic elements, well-structured JavaScript with typed arrays is fast enough and far simpler.
Profile first. Port hot paths only. Measure again.

The numbers don't lie, but they also don't tell the whole story. A 3.9x speedup sounds impressive until you realize your frame budget is 16ms and your JS implementation already runs in 0.7ms. Choose the tool that solves your actual bottleneck, not the one that wins benchmarks.