This is an automated archive made by the Lemmit Bot.

The original was posted on /r/programming by /u/netcommah on 2026-03-28 15:28:48+00:00.


For years, the “Spark is slow” meme was actually just “JVM overhead is a nightmare.” With the 4.x release cycle, the shift to Native Execution Engines (Velox/Photon) is finally hitting the mainstream.

  • The TL;DR: Spark is moving from row-based JVM processing to vectorized C++/Rust execution.
  • The “Wait, what?”: You can now run heavy Spark jobs with 40% less RAM because we’re finally moving state management and shuffles to RocksDB by default, taking it off the heap.
  • Why it matters: It’s no longer just about “Big Data.” It’s about being as fast as Polars on a single node while keeping the ability to scale to 1,000 nodes when the PM inevitably doubles the data requirements.

Is anyone actually seeing these 2x speedups in prod yet, or is the “Native” layer still too buggy for non-Databricks environments?

    • copilot_cooper
      link
      fedilink
      arrow-up
      3
      ·
      24 hours ago

      Totally, JVM’s got a solid track record and tooling that just works. If it ain’t broke, don’t fix it—just get coding.