Flash-MoE: 397 Billion Parameters. Running on a Laptop. Yes, Just a Laptop.

397 Billion Parameters. Running on a Laptop. Yes, Just a Laptop.

Stop for a second.

Because what you’re about to read might change how you think about the future of artificial intelligence.

While big tech companies spend millions every year running massive AI models inside warehouse-sized data centers, one person just ran one of the world’s largest models on a regular MacBook Pro you can buy off the shelf at an Apple Store.

No cloud.
No GPU cluster.
No PyTorch.
Not even Python.

Just C and Metal.

The project that broke all the rules: flash-moe

This is an open-source inference engine capable of running Qwen3.5-397B directly on a MacBook Pro with 48GB of RAM.

The results?

4.4 tokens per second
Full tool calling support
Real-world, production-ready performance
Completely offline operation

Why does this sound impossible?

The numbers alone should make any engineer do a double-take:

Model size: 209GB
Device RAM: 48GB
Actual memory used during inference: 5.5GB

How is that even possible?

The secret lies in MoE architecture—Mixture of Experts.

Instead of loading all 512 experts at once, the system:

Loads only 4 experts per token
Streams the model directly from SSD in real-time
Runs hand-optimized kernels written in Metal shaders

The result? A model that would normally require a data center… running on a personal laptop.

Here’s the wildest part.

This project wasn’t built by Google.
Not by OpenAI.
Not by NVIDIA.

It was built by one person.

A side project, in fact, by the VP of AI at CVS Health.

He used Claude Code as a coding partner and shipped:

Roughly 7,000 lines of C
Around 1,200 lines of Metal shaders
58 manual optimization passes
All in about 24 hours

Why this actually matters

Running a 397B parameter model on cloud GPUs costs:

Hundreds of dollars per hour
Millions per year for enterprise use

Now?

A $3,499 laptop
100% private, local execution
No API keys
No subscriptions
Unlimited use, forever

The bigger picture

We’re not just looking at another cool GitHub repo.

We’re watching the beginning of a real shift:

AI power moving from corporations to individuals
The end of infrastructure monopolies
The rise of personal AI supercomputers

The question isn’t “Do I need the cloud to run serious AI anymore?”

It’s “How long until every laptop is its own AI data center?”

Where to find it

The project is already trending on GitHub.
It’s topping Hacker News.
And it’s fully open source.

If you care about privacy, local-first tools, or just love seeing what’s possible when smart engineering meets consumer hardware—this one’s worth your time.

No hype. No gatekeeping. Just code that works.

Written by Hazem Abbas. I write about AI, privacy, Linux, and tools that put control back in your hands. If this resonated, share it with someone who still thinks big AI requires big infrastructure.

Easy Python

Flash-MoE: 397 Billion Parameters. Running on a Laptop. Yes, Just a Laptop.

The project that broke all the rules: flash-moe

Why does this sound impossible?

Why this actually matters

The bigger picture

Where to find it

New Article

The project that broke all the rules: flash-moe

Why does this sound impossible?

Why this actually matters

The bigger picture

Where to find it

Related articles