LakeSail
A Rust-native, drop-in Spark replacement that runs in your cloud account to deliver faster data processing and significantly lower compute costs.
LakeSail is a next-generation, Rust-native data and AI platform designed as a drop-in replacement for Apache Spark. By replacing the traditional JVM-based runtime with a high-performance engine built from the ground up in Rust, LakeSail eliminates the common bottlenecks associated with legacy data platforms, such as garbage collection pauses, memory tuning complexities, and inefficient serialization overhead. Because it is fully compatible with the Spark Connect protocol, organizations can leverage their existing PySpark, Spark SQL, Delta Lake, and Apache Iceberg pipelines without any code rewrites, simply by swapping the connection endpoint. LakeSail is deployed directly into the user's own AWS account, ensuring data sovereignty while benefiting from a stateless architecture that autoscales to zero, eliminating idle infrastructure costs.
The platform provides a unified engine for batch processing, stream processing, interactive SQL queries, and AI/ML workloads. By leveraging Apache Arrow and DataFusion as its foundational technologies, it delivers vectorized execution and SIMD acceleration, enabling Python-native workloads to run at engine speed without moving data across a JVM boundary. Beyond its core compute capabilities, LakeSail is designed for the agentic era, featuring a native Model Context Protocol (MCP) server that allows AI agents to interact directly with the data layer, perform branching for sandboxed experimentation, and maintain full lineage of their operations.
Some of the key features are:
- Drop-in Compatibility: Fully implements the Spark Connect protocol, requiring zero rewrites of existing PySpark or Spark SQL code.
- Rust-Native Engine: Built in Rust to provide instant startup, zero garbage collection pauses, and significantly reduced memory overhead compared to JVM-based engines.
- Native Python Performance: Executes Python UDFs directly in-process via PyO3, bypassing traditional serialization and inter-process communication bottlenecks.
- Agent-First Infrastructure: Ships with a native MCP server and supports lakehouse branching, allowing AI agents to create sandboxes, review diffs, and commit changes to production data safely.
- Open Standards: Native support for Apache Iceberg and Delta Lake, ensuring no vendor lock-in and no proprietary format conversions.
- Stateless Architecture: Fully stateless workers that scale to zero, ensuring users only pay for compute resources during active task execution.
- Transparent Cost Model: Uses a predictable billing model based on hardware utilization without hidden platform markups or opaque usage units.
LakeSail functions by deploying a stateless compute cluster within the user's AWS account. Users continue to write and manage code using the standard Spark DataFrame API, but connect to the LakeSail endpoint instead of a legacy Spark cluster. The engine's query optimizer and execution layer, powered by DataFusion, translate incoming logical plans into highly efficient physical execution plans. AI agents can be attached via the MCP server to automate data tasks, using ephemeral branches of the lakehouse to perform transformations which can then be audited and committed back to the primary data repository through the platform's dashboard or API.
Some common use cases include:
- Legacy Spark Modernization: Replacing existing JVM-based Spark infrastructure to drastically reduce infrastructure costs and eliminate complex JVM tuning.
- AI Agentic Workflows: Deploying autonomous agents that need direct, high-performance access to massive datasets for inference, feature engineering, or data refinement.
- Interactive Data Analysis: Running high-speed, ad-hoc analytical queries on large-scale lakehouse storage using familiar SQL or Python interfaces.
- Cost-Effective ETL Pipelines: Offloading resource-intensive batch data transformations to a stateless, auto-scaling engine that minimizes cloud compute expenditures.