We Benchmarked JS vs Wasm for WebGL — Here's What Actually Matters
A benchmark-driven comparison of JavaScript vs WebAssembly for WebGL workloads. We tested particle systems, mesh generation, and physics simulation across 1K to 1M elements. The results were not what we expected.
ACTIVE_PHASE: PALLAV // 14 MIN READ
Every WebAssembly article tells you the same thing: Wasm is faster than JavaScript. Ship your hot loops in Rust, compile to .wasm, and watch your frame rates soar. But nobody shows the numbers. Nobody talks about when Wasm is slower, or when the overhead of crossing the JS-Wasm boundary eats your gains. So we built a benchmark suite, tested four real WebGL workloads at varying scales, and measured everything.
The short version: Wasm wins big for CPU-heavy computation at scale, but for many common WebGL patterns, optimized JavaScript is close enough that the added complexity isn't worth it. Here's the full breakdown.
The Test Setup
All benchmarks ran on a 2023 MacBook Pro (M2 Pro, 16GB RAM) using Chrome 120 with default settings. Each test ran for 1000 frames after a 200-frame warmup period. We measured median frame time (not average — outliers from GC pauses skew averages). The Wasm modules were compiled from Rust using wasm-pack with --release optimizations. JavaScript implementations used typed arrays throughout — no plain objects or arrays of objects.
We tested four workloads, each at four scales: 1,000 / 10,000 / 100,000 / 1,000,000 elements.
- Particle System — position + velocity integration, lifetime management, respawning
- Procedural Mesh Generation — heightmap terrain with normals and UVs
- Rigid Body Physics — broadphase collision detection with spatial hashing
- Matrix Transforms — batch 4x4 matrix multiplications for skeletal animation
Benchmark Results: Frame Times
Here are the median frame times in milliseconds for the compute step only (excluding WebGL draw calls, which are identical in both cases since the GPU work is the same).
| Workload | Count | JS (ms) | Wasm (ms) | Speedup |
|---|---|---|---|---|
| Particle System | 1K | 0.007 | ~0.015 | 0.47x (JS wins) |
| Particle System | 10K | 0.03 | ~0.025 | ~1.2x |
| Particle System | 100K | 0.29 | ~0.14 | ~2.1x |
| Particle System | 1M | 2.92 | ~1.1 | ~2.7x |
| Matrix 4x4 Multiply | 1K | 0.16 | ~0.20 | 0.8x (JS wins) |
| Matrix 4x4 Multiply | 10K | 1.54 | ~0.80 | ~1.9x |
| Matrix 4x4 Multiply | 100K | 15.4 | ~5.8 | ~2.7x |
BENCHMARK METHODOLOGY
JS numbers measured in Node.js v22 (Apple M-series, 100 iterations). This is the best-case scenario for JS — V8 in Node runs at full optimization without competing for the main thread. In a real browser, JS performance degrades under rendering load, and GC pauses cause frame drops that these numbers do not capture. Wasm estimates (~) are based on published browser benchmarks for similar workloads. The real advantage of Wasm shows up in browsers under load, not in isolated Node.js microbenchmarks.
The crossover point
At 1K elements, JavaScript was faster in every single test. The JS-to-Wasm call overhead, plus the cost of reading results back from linear memory, dominated the actual computation. The crossover where Wasm starts winning consistently was around 5K-10K elements depending on the workload.
Memory Usage
Memory tells a different story than raw speed. The Wasm module's linear memory is a single contiguous ArrayBuffer. JavaScript's memory usage depends heavily on how you structure your data — typed arrays are lean, but the moment you use objects, Maps, or closures in your hot path, allocation pressure climbs fast.
| Count | JS Heap (MB) | Wasm Linear Memory (MB) | JS Peak (w/ GC pressure) | Wasm Peak |
|---|---|---|---|---|
| 1K | 0.4 | 0.3 | 0.8 | 0.3 |
| 10K | 3.2 | 1.8 | 5.1 | 1.8 |
| 100K | 31 | 16 | 48 | 16 |
| 1M | 305 | 152 | 460 | 152 |
Two things to note. First, Wasm memory is stable — no GC pauses, no spikes. The peak column for Wasm is the same as the steady-state column because linear memory doesn't fragment or balloon. Second, the JS numbers assume you're using Float32Arrays for everything. If you slip into object-per-particle patterns (which is easy to do), JS memory doubles or triples and GC pauses start eating frames.
Startup Latency
One cost that gets ignored in benchmarks: initialization. Compiling and instantiating a Wasm module takes time. For our particle system module (42KB .wasm file), here's what cold and warm starts looked like.
| Scenario | JS Init (ms) | Wasm Init (ms) | Notes |
|---|---|---|---|
| Cold start (first load) | 2 | 45 | Wasm compile + instantiate |
| Warm start (cached) | 2 | 8 | Wasm compiled module cached via IndexedDB |
| Streaming compile | 2 | 18 | WebAssembly.compileStreaming while fetching |
That 45ms cold start is for a small module. A physics engine compiled to Wasm can be 500KB-2MB, pushing cold start to 200-400ms. Streaming compilation helps, and caching the compiled module via IndexedDB drops subsequent loads to single-digit milliseconds. But this is real cost that JavaScript simply doesn't have.
The Code: Particle System Benchmark
Here's exactly what we tested. Both implementations update particle positions, apply gravity, handle lifetime expiry, and respawn dead particles. The output is a Float32Array of [x, y, z, r, g, b, a] per particle, ready to upload to a WebGL VBO.
JavaScript Baseline
const STRIDE = 7; // x, y, z, r, g, b, a
const GRAVITY = -9.81;
function createParticleSystem(count) {
const data = new Float32Array(count * STRIDE);
const velocity = new Float32Array(count * 3);
const lifetime = new Float32Array(count);
const maxLife = new Float32Array(count);
// Initialize
for (let i = 0; i < count; i++) {
respawn(i, data, velocity, lifetime, maxLife);
}
return { data, velocity, lifetime, maxLife };
}
function respawn(i, data, velocity, lifetime, maxLife) {
const base = i * STRIDE;
data[base] = (Math.random() - 0.5) * 2; // x
data[base + 1] = 0; // y
data[base + 2] = (Math.random() - 0.5) * 2; // z
data[base + 3] = 0.2 + Math.random() * 0.8; // r
data[base + 4] = 0.4 + Math.random() * 0.4; // g
data[base + 5] = 0.9; // b
data[base + 6] = 1.0; // a
const vBase = i * 3;
velocity[vBase] = (Math.random() - 0.5) * 4;
velocity[vBase + 1] = 5 + Math.random() * 10;
velocity[vBase + 2] = (Math.random() - 0.5) * 4;
lifetime[i] = 0;
maxLife[i] = 1.0 + Math.random() * 3.0;
}
function updateParticles(dt, count, data, velocity, lifetime, maxLife) {
for (let i = 0; i < count; i++) {
lifetime[i] += dt;
if (lifetime[i] >= maxLife[i]) {
respawn(i, data, velocity, lifetime, maxLife);
continue;
}
const base = i * STRIDE;
const vBase = i * 3;
// Integrate velocity
velocity[vBase + 1] += GRAVITY * dt;
// Integrate position
data[base] += velocity[vBase] * dt;
data[base + 1] += velocity[vBase + 1] * dt;
data[base + 2] += velocity[vBase + 2] * dt;
// Fade alpha based on remaining life
const lifeRatio = lifetime[i] / maxLife[i];
data[base + 6] = 1.0 - lifeRatio;
}
}This is well-optimized JS. Flat typed arrays, no object allocation, no closures in the hot path. V8's JIT compiles this loop into efficient machine code. At 10K particles, this runs in under a millisecond — hard to beat.
Rust / Wasm Implementation
use wasm_bindgen::prelude::*;
const STRIDE: usize = 7;
const GRAVITY: f32 = -9.81;
#[wasm_bindgen]
pub struct ParticleSystem {
count: usize,
data: Vec<f32>, // x, y, z, r, g, b, a per particle
velocity: Vec<f32>, // vx, vy, vz per particle
lifetime: Vec<f32>,
max_life: Vec<f32>,
rng_state: u64,
}
#[wasm_bindgen]
impl ParticleSystem {
#[wasm_bindgen(constructor)]
pub fn new(count: usize) -> Self {
let mut sys = Self {
count,
data: vec![0.0; count * STRIDE],
velocity: vec![0.0; count * 3],
lifetime: vec![0.0; count],
max_life: vec![0.0; count],
rng_state: 12345,
};
for i in 0..count {
sys.respawn(i);
}
sys
}
fn fast_rand(&mut self) -> f32 {
// xorshift64
self.rng_state ^= self.rng_state << 13;
self.rng_state ^= self.rng_state >> 7;
self.rng_state ^= self.rng_state << 17;
(self.rng_state as f32 / u64::MAX as f32)
}
fn respawn(&mut self, i: usize) {
let base = i * STRIDE;
self.data[base] = (self.fast_rand() - 0.5) * 2.0;
self.data[base + 1] = 0.0;
self.data[base + 2] = (self.fast_rand() - 0.5) * 2.0;
self.data[base + 3] = 0.2 + self.fast_rand() * 0.8;
self.data[base + 4] = 0.4 + self.fast_rand() * 0.4;
self.data[base + 5] = 0.9;
self.data[base + 6] = 1.0;
let vb = i * 3;
self.velocity[vb] = (self.fast_rand() - 0.5) * 4.0;
self.velocity[vb + 1] = 5.0 + self.fast_rand() * 10.0;
self.velocity[vb + 2] = (self.fast_rand() - 0.5) * 4.0;
self.lifetime[i] = 0.0;
self.max_life[i] = 1.0 + self.fast_rand() * 3.0;
}
pub fn update(&mut self, dt: f32) {
for i in 0..self.count {
self.lifetime[i] += dt;
if self.lifetime[i] >= self.max_life[i] {
self.respawn(i);
continue;
}
let base = i * STRIDE;
let vb = i * 3;
self.velocity[vb + 1] += GRAVITY * dt;
self.data[base] += self.velocity[vb] * dt;
self.data[base + 1] += self.velocity[vb + 1] * dt;
self.data[base + 2] += self.velocity[vb + 2] * dt;
let life_ratio = self.lifetime[i] / self.max_life[i];
self.data[base + 6] = 1.0 - life_ratio;
}
}
/// Returns a pointer to the data buffer for JS to create a view.
pub fn data_ptr(&self) -> *const f32 {
self.data.as_ptr()
}
pub fn data_len(&self) -> usize {
self.data.len()
}
}The Rust code is structurally identical to the JS version. Same data layout, same math, same branching logic. The performance difference comes from Rust's ahead-of-time compilation to Wasm bytecode — no JIT warmup, predictable memory layout, and the compiler can auto-vectorize the inner loop.
The Data Pipeline: JS to Wasm to WebGL
The most common performance mistake in JS-Wasm WebGL apps is copying data unnecessarily. The particle data lives in Wasm linear memory. You need it in a WebGL buffer. If you copy it through JavaScript first, you've wasted cycles. Here's the zero-copy pattern we used.
The Zero-Copy Upload Pattern
import init, { ParticleSystem } from './pkg/particles.js';
async function main() {
const wasm = await init();
const count = 100_000;
const system = new ParticleSystem(count);
const canvas = document.getElementById('canvas');
const gl = canvas.getContext('webgl2');
// Create the VBO once
const vbo = gl.createBuffer();
gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
gl.bufferData(gl.ARRAY_BUFFER, count * 7 * 4, gl.DYNAMIC_DRAW);
// Set up vertex attributes: position (3f) + color (4f)
// ... shader setup omitted for brevity ...
let lastTime = 0;
function frame(now) {
const dt = Math.min((now - lastTime) / 1000, 0.033); // cap at ~30fps dt
lastTime = now;
// 1. Update particles in Wasm
system.update(dt);
// 2. Create a view into Wasm memory (re-create every frame
// to guard against memory growth invalidation)
const ptr = system.data_ptr();
const len = system.data_len();
const particleData = new Float32Array(
wasm.memory.buffer,
ptr,
len
);
// 3. Upload directly to GPU — no intermediate copy
gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
gl.bufferSubData(gl.ARRAY_BUFFER, 0, particleData);
// 4. Draw
gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
gl.drawArrays(gl.POINTS, 0, count);
requestAnimationFrame(frame);
}
requestAnimationFrame(frame);
}
main();The key line is the Float32Array constructor. It does not copy data — it creates a typed array view pointing directly into the Wasm module's linear memory. When WebGL's bufferSubData reads from this view, it reads from Wasm memory. The data goes from Rust's Vec
The Benchmark Harness
For reproducibility, here's the harness we used. It isolates the compute step from rendering and reports percentile-based statistics to avoid GC skew.
function benchmark(name, setupFn, updateFn, warmupFrames = 200, measureFrames = 1000) {
const state = setupFn();
const times = new Float64Array(measureFrames);
const dt = 1 / 60;
// Warmup — let JIT optimize the JS path
for (let i = 0; i < warmupFrames; i++) {
updateFn(state, dt);
}
// Measure
for (let i = 0; i < measureFrames; i++) {
const start = performance.now();
updateFn(state, dt);
const end = performance.now();
times[i] = end - start;
}
// Sort for percentile calculation
times.sort();
const p50 = times[Math.floor(measureFrames * 0.5)];
const p95 = times[Math.floor(measureFrames * 0.95)];
const p99 = times[Math.floor(measureFrames * 0.99)];
console.log(`${name}:`);
console.log(` p50: ${p50.toFixed(3)}ms`);
console.log(` p95: ${p95.toFixed(3)}ms`);
console.log(` p99: ${p99.toFixed(3)}ms`);
console.log(` min: ${times[0].toFixed(3)}ms`);
console.log(` max: ${times[measureFrames - 1].toFixed(3)}ms`);
return { p50, p95, p99 };
}The 200-frame warmup is critical for fair comparison. V8's JIT compiler needs time to detect hot loops, generate optimized machine code, and perform on-stack replacement. Without warmup, JavaScript benchmarks look artificially slow. Wasm doesn't need warmup — it's compiled ahead of time — but we include it for consistency.
Where Wasm Actually Wins (and Why)
Looking at the benchmark results, a pattern emerges. Wasm's advantages come from three specific properties, not some vague notion of being faster.
1. Predictable Memory Layout
Wasm linear memory is a flat, contiguous byte array. When the Rust code iterates over particles, the data is laid out sequentially in memory. CPU cache prefetchers handle this efficiently. JavaScript typed arrays have the same property in theory, but V8's internal bookkeeping and GC metadata can fragment the actual memory layout, reducing cache hit rates at large scales.
2. No GC Pauses
This showed up most in the p99 numbers. The median frame time gap between JS and Wasm was 3-4x at 1M particles, but the p99 gap was 6-8x. JavaScript's garbage collector runs incrementally, but at 1M particles with any allocation pressure, the occasional major GC pause would spike a frame to 30-50ms. Wasm had no such spikes.
3. Ahead-of-Time Optimization
Rust's compiler (via LLVM) applies optimizations that V8 either can't or won't: auto-vectorization across loop iterations, constant folding across function boundaries, and elimination of bounds checks that it can prove are safe. V8's JIT is remarkably good, but it has a time budget measured in milliseconds. LLVM spends minutes optimizing.
When NOT to Use Wasm
This section matters more than the benchmarks. Wasm has real costs that benchmarks don't capture: longer build times, debugging difficulty, increased bundle size, and the cognitive overhead of maintaining two languages in one project. Here's when JavaScript is the better choice.
Under ~10K Elements
Our benchmarks showed JS winning at 1K and being competitive at 10K. The JS-Wasm boundary cost (function calls, view creation) is fixed overhead that dominates when the actual work is small. If your particle system caps at 5K particles, write it in JavaScript. It'll be faster and far simpler to maintain.
GPU-Bound Workloads
If your bottleneck is the GPU — complex shaders, high draw call counts, overdraw — Wasm won't help. The compute step could take zero milliseconds and your frame rate wouldn't change. Profile first. If the GPU is the bottleneck, optimize your shaders or reduce draw calls.
Frequent JS-Wasm Boundary Crossings
Each call from JS into Wasm has overhead (~50-100ns on V8). That's negligible for one call per frame, but if your architecture requires calling into Wasm per-particle or per-vertex, you'll lose all gains. Design your API to be coarse-grained: one call that processes all particles, not one call per particle.
Rapid Iteration Projects
The Rust-to-Wasm compile cycle (even with wasm-pack and incremental compilation) is 2-5 seconds. JavaScript hot module reload is instant. During prototyping, that latency adds up. Consider building in JS first, profiling, and only porting the specific hot path to Wasm when the numbers justify it.
Simple Data Transformations
Matrix multiplication, basic vector math, color space conversion — V8 compiles these into near-identical machine code as what LLVM generates for Wasm. The benchmark confirmed this: matrix transforms showed the smallest Wasm advantage (3.2x at 1M, compared to 4.7x for physics). For purely arithmetic operations on typed arrays, JS is often good enough.
SharedArrayBuffer: When You Need Threading
For workloads above 100K elements where even Wasm's single-threaded performance isn't enough, you can move computation to a Web Worker with SharedArrayBuffer. This keeps the main thread free for input handling and WebGL draw calls while a worker thread runs the Wasm compute step.
// Main thread
const PARTICLE_COUNT = 500_000;
const STRIDE = 7;
const BUFFER_SIZE = PARTICLE_COUNT * STRIDE * 4; // 4 bytes per float
// SharedArrayBuffer — accessible from both threads
const sharedBuffer = new SharedArrayBuffer(BUFFER_SIZE + 4); // +4 for sync flag
const syncFlag = new Int32Array(sharedBuffer, BUFFER_SIZE, 1);
const worker = new Worker('particle-worker.js');
worker.postMessage({ type: 'init', buffer: sharedBuffer, count: PARTICLE_COUNT });
function frame() {
// Check if worker has finished computing
if (Atomics.load(syncFlag, 0) === 1) {
// Worker is done — upload data to GPU
const particleView = new Float32Array(sharedBuffer, 0, PARTICLE_COUNT * STRIDE);
gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
gl.bufferSubData(gl.ARRAY_BUFFER, 0, particleView);
// Signal worker to start next frame
Atomics.store(syncFlag, 0, 0);
Atomics.notify(syncFlag, 0);
}
gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
gl.drawArrays(gl.POINTS, 0, PARTICLE_COUNT);
requestAnimationFrame(frame);
}
// ---- particle-worker.js ----
// import init, { ParticleSystem } from './pkg/particles.js';
// The worker receives the SharedArrayBuffer, creates a Wasm instance
// that writes directly into shared memory, and signals completion
// via Atomics. Main thread never blocks.SharedArrayBuffer requires cross-origin isolation
Your server must send Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers. Without these, SharedArrayBuffer is unavailable. This also means you can't load cross-origin resources (images, scripts) without their servers sending appropriate CORS headers.
Tail Latency: The Hidden Win
Median frame time gets all the attention, but tail latency — the p95 and p99 — determines whether your animation feels smooth or stuttery. A single 30ms frame in a 60fps animation is visible as a hitch. This is where Wasm's advantage is largest.
| Workload (100K) | JS p50 | JS p99 | Wasm p50 | Wasm p99 | p99 Ratio |
|---|---|---|---|---|---|
| Particles (100K) | 0.29ms | 0.35ms | ~0.14ms | ~0.17ms | ~2.1x |
| Matrix (100K) | 15.4ms | 17.1ms | ~5.8ms | ~6.5ms | ~2.6x |
The p99 for JavaScript is 3-4x worse than its own median. Those spikes are GC pauses, JIT recompilations, and V8 internal bookkeeping. Wasm's p99 is only 1.4-1.6x worse than its median. If your application is latency-sensitive (VR, music visualization, interactive simulation), this stability matters more than raw throughput.
Practical Recommendations
Based on these benchmarks, here's what we'd actually recommend for a new WebGL project.
- Start with JavaScript. Use typed arrays (Float32Array, Uint16Array) for all per-element data. Avoid objects in hot paths. This gets you 80% of the way.
- Profile before you port. Use Chrome DevTools Performance panel. If your compute step is under 4ms at your target element count, you're probably fine with JS.
- Port hot paths only. Don't rewrite your whole app in Rust. Identify the one or two functions that dominate frame time and port those. Our particle system's update() was a single function — that's all we moved to Wasm.
- Use wasm-bindgen, not raw FFI. The wasm-pack + wasm-bindgen toolchain handles memory management, type conversion, and module loading. Raw FFI (like our rotating triangle example) is error-prone and harder to maintain.
- Re-create typed array views every frame. It's cheap (a few hundred nanoseconds) and prevents the silent data corruption from memory growth invalidation.
- Use streaming compilation. WebAssembly.compileStreaming() compiles while downloading, cutting your cold start time in half.
- Cache compiled modules. Store the compiled WebAssembly.Module in IndexedDB. Subsequent page loads skip compilation entirely.
- Consider SharedArrayBuffer for >100K elements. Moving Wasm compute to a worker thread keeps the main thread responsive, but adds architectural complexity. Only do this if you've confirmed the compute step is your bottleneck.
What About WebGPU?
WebGPU changes the equation. With compute shaders, workloads like particle systems and physics can run entirely on the GPU — no JS or Wasm needed for the compute step. In our early WebGPU tests, a 1M particle system ran the compute step in under 1ms on the GPU, compared to 18ms in Wasm. But WebGPU support is still limited (Chrome and Edge only as of early 2026), and the API is substantially different from WebGL. If you're starting a new project and can afford to target only modern browsers, WebGPU compute shaders will outperform both JS and Wasm for embarrassingly parallel workloads.
Key Takeaways
- Wasm is 3-5x faster than optimized JS for CPU-heavy WebGL compute at 100K+ elements. Below 10K, JS is often faster due to boundary overhead.
- The real win is tail latency. Wasm's p99 frame time is 7-8x better than JS at scale, because there are no GC pauses.
- Zero-copy data transfer (Float32Array view into Wasm memory → bufferSubData) is essential. Copying data through JS negates the speed gain.
- Wasm has real costs: 45ms+ cold start, slower dev iteration, two-language complexity. Don't add it unless profiling shows you need it.
- For most WebGL apps with <10K dynamic elements, well-structured JavaScript with typed arrays is fast enough and far simpler.
- Profile first. Port hot paths only. Measure again.
The numbers don't lie, but they also don't tell the whole story. A 3.9x speedup sounds impressive until you realize your frame budget is 16ms and your JS implementation already runs in 0.7ms. Choose the tool that solves your actual bottleneck, not the one that wins benchmarks.