Apache Software Foundation Announces New Top-Level Project Apache® DataFusion™ 

Fast, extensible query engine for building high-quality data-centric systems in Rust is now a Top-Level Project

Wilmington, DE, June 11,  2024 – The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 320 active open source projects and initiatives, today announced Apache® DataFusion™ is now a Top-Level Project (TLP). DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. To download the latest release of DataFusion, visit https://datafusion.apache.org/download.html

DataFusion aims to be the query engine of choice for new, fast, data-centric systems such as databases, dataframe libraries, machine learning, and streaming applications by leveraging the unique features of Apache Arrow and Rust. By using DataFusion, projects can focus on developing specific features and avoid reimplementing standard features such as an expression representation, standard optimizations, parallelized streaming execution plans, file format support, etc.

DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems. It is used for systems focused on analytic (high throughput), streaming and transaction (low latency) workloads such as: 

  • Specialized analytical database systems such as Apache HoraeDB
  • New query language engines such as prql-query and accelerators such as VegaFusion
  • Research platforms for new database systems, such as opt-d
  • Streaming data platforms such as Synnada
  • SQL support for another library, such as dask-sql 
  • Tools for reading / sorting / transcoding files such as qv
  • Apache Spark runtime replacements such as Comet and Blaze 

“Apache DataFusion has grown tremendously since its inception. What started as a modest project to provide a simple and efficient query engine has evolved into a robust, high-performance system that powers data-centric applications worldwide. This growth is a testament to the Apache Way,” said Andy Grove, Apache DataFusion PMC Member and original creator of DataFusion. “Becoming a Top-Level Project is a significant milestone, and I am excited to see how the project will continue to innovate and shape the future of data processing.” 

DataFusion Feature Highlights 

  • Fast, vectorized, multi-threaded, streaming execution engine
  • Support for Parquet, CSV, JSON, and Avro file formats via built in plugins 
  • Support for custom file formats and non file data sources via extension traits
  • Many extension points: user defined scalar/aggregate/window functions, data sources, SQL, other query languages, custom plan and execution nodes, optimizer passes, and more
  • A state-of-the-art query optimizer with expression coercion and simplification, projection and filter pushdown, sort and distribution aware optimizations, automatic join reordering, and more
  • Streaming, asynchronous input/output directly from popular object stores, including AWS S3, Azure Blob Storage, and Google Cloud Storage (Other systems are supported via extensions)
  • Support for Substrait to easily pass plans across language and system boundaries
  • Implementation in Rust

“DataFusion’s capabilities have been integral to the development of InfluxDB 3.0. By building with and contributing to this project, we’ve been able to deliver a powerful, vectorized SQL engine to our users, all while benefiting from continuous improvements from a dedicated global community,” said Paul Dix, CTO and co-founder of InfluxData.

Additional Resources

DataFusion has been developed at the Apache Software Foundation since 2019 as part of the Apache Arrow project. The community has since grown to include so many users and contributors that it graduated to a top level project to provide more focused governance capacity for continued growth.

About The Apache Software Foundation (ASF)
Founded in 1999, the Apache Software Foundation exists to provide software for the public good with support from more than 75 sponsors. ASF’s open source software is used ubiquitously around the world with more than 8,400 committers contributing to 320+ active projects including Apache Superset, Apache Camel, Apache Flink, Apache HTTP Server, Apache Kafka, and Apache Airflow. The Foundation’s open source projects and community practices are considered industry standards, including the widely adopted Apache License 2.0, the podling incubation process, and a consensus-driven decision model that enables projects to build strong communities and thrive. https://apache.org

ASF’s annual Community Over Code event is where open source technologists convene to share best practices and use cases, forge critical relationships, and learn about advancements in their field. https://communityovercode.org/ 

© The Apache Software Foundation. “Apache” is a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

Media Contact
press@apache.org