ASF Project Spotlight: Apache Iceberg

Published on April 29, 2026
By The ASF

Dipankar Mazumdar is the Director of Developer Relations at Cloudera, leading global developer initiatives across lakehouse architecture and AI. He previously held advocacy and engineering roles at Dremio, Onehouse, and Qlik, contributing to open source projects including Apache Iceberg, Apache Hudi, and Apache XTable (incubating) and building communities. His work focuses on the intersection of data engineering and AI. He is the author of Engineering Lakehouses with Open Table Formats and a contributor to Apache Iceberg: The Definitive Guide.

Apache Iceberg has quickly become a foundational technology in modern data architectures—but its impact goes far beyond performance and scale. This conversation with Dipankar explores how Iceberg redefined the data lake, and how community, education, and open collaboration fueled its adoption.

What Is Apache Iceberg and Why It Exists

Q: Can you tell us a bit about Apache Iceberg?
A: Apache Iceberg is a high-performance open table format for huge analytic datasets. It was designed to bring reliability and simplicity to data lakes, allowing multiple engines to safely read and write to the same datasets with strong guarantees. By introducing a table abstraction on top of raw data files, Iceberg helps organizations manage large-scale data with the consistency typically associated with data warehouses while retaining the flexibility of data lakes.

Q: What did the data ecosystem look like before Apache Iceberg, and what gaps did you see early on?
A: Before Iceberg, most data lakes relied on tables built on technologies like Apache Hive, with Apache Parquet as the underlying storage format. These systems worked well in the Hadoop era, but as workloads diversified and organizations were moving to cloud object stores (like Amazon S3), a number of structural limitations began to surface. Updates were unreliable, partitioning strategies were brittle, and schema evolution was difficult to manage. Metadata handling also became increasingly expensive, especially with large numbers of files, and query performance would degrade over time.

At the same time, data warehouses abstracted all of this away behind proprietary systems, so many engineers never had to think about these problems directly. However, they were dealing with significant cost issues and entering a vendor-locked environment.

These limitations/issues in both data lakes and warehouses made it clear that a new approach was needed that treated tables as first-class objects rather than just collections of files.

From Netflix to Apache: How Iceberg Took Shape

Q: When was Iceberg started and why?
A: Iceberg was originally developed at Netflix to address these large-scale data challenges. The team needed a solution that could handle massive datasets reliably while supporting evolving data requirements. Recognizing that these challenges were industry-wide, the project was later open sourced and contributed to The Apache Software Foundation (ASF) in 2018 to foster broader collaboration and adoption.

Q: What technology problem is Apache Iceberg solving?
A: Iceberg addresses several fundamental issues in traditional data lakes:

Lack of consistency when multiple engines works on the same data
Complex and brittle partitioning strategies
Interoperability at the storage layer
Challenges with schema and partition evolution

By rethinking how tables are defined and managed, Iceberg enables scalable, reliable data operations without the overhead and fragility of legacy approaches.

Q: What were some of the most important design decisions that shaped Iceberg early on?
A: A few key principles guided Iceberg’s design:

Treating metadata as a first-class concern for performance and scalability
Decoupling logical table structure from physical storage layout
Supporting schema evolution as a core feature, not an afterthought
Enabling engine-agnostic access to the data

These decisions allowed Iceberg to avoid many of the constraints that limited earlier systems.

Real-World Use and Impact

Q: Iceberg is known for working across multiple compute engines. Why was that so important from the start?
A: Interoperability is essential because modern data ecosystems rely on multiple processing engines. Iceberg was designed to act as a shared table layer, enabling different tools to safely access the same data without tight coupling. This approach gives organizations flexibility and helps prevent vendor lock-in.

Q: Are there any use cases you would like to tell us about?
A: Iceberg is used across industries for:

Large-scale analytics and reporting
AI pipelines
Streaming and batch data processing

It enables teams to unify different workloads on a single, reliable data foundation.

The Challenge: Explaining a New Layer of the Data Stack

Q: What made evangelizing Iceberg particularly challenging in those early days?
A: The challenge wasn’t just that Iceberg was new – it was that the problem it addressed wasn’t clearly recognized.

From the outside, most systems appeared to be working. Data warehouses handled structured workloads, data lakes handled large-scale storage, and teams had already built processes around them. So when we started talking about open table formats and formal specifications, it didn’t feel like solving an urgent problem.

Even when issues did exist, they weren’t attributed to the table abstraction itself. Slow jobs, partitioning limitations, schema breakages, or data corruption from concurrent writes were seen as isolated operational problems. Teams would patch them with scripts, conventions, or workarounds rather than questioning the underlying design.

That made the conversation harder. We weren’t just introducing a new approach – we were pointing out that a foundational layer people relied on had limitations they hadn’t fully understood yet.

Building Understanding: Advocacy, Education, and Content

Q: How did advocacy around Iceberg evolve beyond the core PMC & Committers?
A: In the beginning, most of the evangelism came from the engineers building the project. There wasn’t a structured effort around storytelling or community building.

That started to change when dedicated developer advocacy roles emerged with a focus on Iceberg. I was lucky enough to have been in that boat. Back in 2021-2022, there was no established playbook for how to evangelize an open table format. The approach was largely experimental, i.e. learning in public, translating technical concepts into something practical, and consistently engaging with the community.

Over time, that effort became more deliberate. If the abstraction wasn’t obvious to most people, we had to make it understandable. And in places where the ecosystem didn’t even have the right words, we had to build that vocabulary ourselves.

Q: At what point did you realize that community and education would play a central role?
A: It became clear fairly early that adoption was not going to happen through feature awareness alone. We were asking people to think about a layer they had historically never needed to reason about. Most engineers were focused on pipelines, performance, and reliability. The storage layer was abstracted away, and the issues they encountered, like slow jobs, partitioning problems, schema breakages, were treated as isolated operational concerns rather than symptoms of a deeper limitation.

So before we could talk about Iceberg, we had to establish why the table abstraction itself mattered. That meant explaining what a table actually represents in a data lake, how metadata governs behavior, and why things like snapshot isolation or partition evolution are not features, but necessary primitives for running multiple workloads safely.

Again, those are not surface-level concepts. They require building mental models from the ground up. That’s where community and education became central!

Q: You mentioned technical content playing a huge role in Iceberg’s early adoption. Can you share more?
A: Absolutely! Long-form technical content became the foundation for our advocacy goals. We definitely were more interested in explaining how the system actually works – what happens under the hood, how reads and writes are executed, and how design decisions translate into real-world behavior.

At the same time, it became clear that reading alone wasn’t enough for engineers. They needed to run things themselves and understand what Iceberg brought to the table. That led to hands-on exercises where people could experiment with things like table creation, schema evolution, partitioning, and table optimization to see the behavior directly.

Deep explanations paired with practical exploration helped bridge the gap between theory and adoption.

Community in Action: From Conversations to Momentum

Q: How did live engagement (talks, office hours, etc.) shape the evolution of the Iceberg community?
A: Webinars, conferences, and live sessions enabled a space for real-time interaction. Early on, a lot of those sessions were spent clarifying fundamentals – what Iceberg is and what it isn’t.

But over time, the nature of the conversation changed. Engineers began bringing their own systems, workloads, and constraints into the discussion. And the questions shifted from conceptual to more operational – which was great to see. That’s what led to introducing dedicated office hours. Instead of one-to-many sessions, these became open forums where people could discuss real production scenarios. Those conversations became one of the most important feedback loops, shaping how we explained Iceberg and highlighting gaps in understanding.

Conferences became places where the ecosystem came together. And the conversations moved from understanding the technology to discussing how it was being used in production. People started comparing approaches, sharing operational strategies, and aligning on best practices. That’s when you could see the shift that Iceberg was no longer just a technology that came out of Netflix- it was becoming a shared foundation that multiple organizations were building on.

Q: Was there a moment when you realized Iceberg was gaining real momentum?
A: The inflection point came when the community itself started shaping the narrative. Iceberg was never positioned as a vendor-owned format, and its specification evolved in the open.

As more contributors, adopters, and organizations got involved, the conversation expanded beyond any single perspective. Integrations, use cases, and improvements came from different directions, each grounded in real production needs. At that point, growth was no longer something being driven – it was something emerging from the open source community itself. And that was a huge success for us!

Q: How did you contribute to building and growing the Apache Iceberg community?
A: A lot of the early Iceberg activity was fragmented. There were contributors building the system, users trying it out, and discussions happening across Slack, mailing lists, and conferences – but these weren’t always connected. We were quite deliberate about bringing that together.

That meant consistently creating places where those interactions could happen. We were actively helping users on Slack, running webinars, publishing blogs and books, setting up office hours, and making sure the people actually building and using Iceberg were part of those conversations. Instead of keeping discussions isolated, we tried to pull them into shared spaces.

Over time, that started to compound. We turned questions from users into topics for talks. There were production use cases that would show up in webinars or conferences. Conversations that happened in one place would get carried into others. It gave the community a kind of continuity that wasn’t there before.

It also changed the level of discussion. Early on, most conversations were about understanding what Iceberg is. As more people got involved, it shifted toward how it behaves under real workloads – scale, concurrency, and multi-engine access. This shift came from repeatedly bringing practitioners into the same loop and letting those discussions build on each other.

That’s really where we had the most impact – creating that loop and keeping it active.

The Apache Way: Governance and Community Over Code

Q: How has moving to The ASF influenced the project’s growth and direction?
A: Becoming part of The ASF reinforced Iceberg’s commitment to open governance and vendor neutrality. It created a foundation for long-term sustainability and encouraged broader participation from across the industry.

Q: The ASF’s mission is to provide software for the public good. In what ways does your project embody that mission and the “community over code” ethos?
A: Iceberg is developed in the open, with decisions shaped by a diverse community rather than a single organization. This collaborative approach ensures the technology serves a wide range of users and use cases. The emphasis on shared ownership and transparency reflects the ASF’s “community over code” philosophy.

Getting Started and Getting Involved

Q: What advice would you give to teams considering adopting Iceberg today?
A: Start by understanding your current data challenges and how Iceberg (and the open lakehouse ecosystem) align with your needs. Take advantage of its interoperability to experiment within your existing ecosystem and tools and engage with the community to learn from others’ experiences.

Q: What’s the best way to learn about the project and try it out?
A: The best way to get started is through the official documentation and by experimenting with Iceberg using your preferred data processing engine. Community channels (Slack), Dev mailing list, and conference talks also provide valuable guidance for new users (see links below).There are also other avenues like Cloudera Community, Developer Playlist, and Iceberg 101 (Dremio) that focus on Iceberg’s education.

Q: How can others contribute to this project (code contributions being only one of the ways)?
A: Contributions come in many forms, including improving documentation, sharing use cases, participating in discussions, and helping others adopt the technology. These efforts are just as important as code in building a strong, sustainable project.

What’s Next for Apache Iceberg

Q: What does the future hold for the project?
A: A big part of where Iceberg is heading is around supporting newer workloads, especially AI-driven ones. We are starting to see more demand for things like wider schemas, semi-structured data, and access patterns that go beyond traditional analytics. To support that, there’s growing interest in areas like vector-based indexing and more flexible data representations.

At the same time, there’s continued focus on making the system more efficient. That includes improvements to metadata handling and commit performance, which become more important as tables scale. There’s also work around indexing. Instead of relying only on file-level stats, newer proposals are exploring primary, secondary, and even vector-based indexes, especially as access patterns shift.

Overall, the future direction is about evolving the core pieces, so Iceberg can support a broader set of workloads while keeping the same design principles.

Iceberg Resources

The ASF is home to nearly 9,000 committers contributing to more than 320 active projects including Apache Airflow, Apache Camel, Apache Flink, Apache HTTP Server, Apache Kafka, and Apache Superset. With the support of volunteers, developers, stewards, and more than 75 sponsors, ASF projects create open source software that is used ubiquitously around the world. This work helps us realize our mission of providing software for the public good.

In the midst of hosting community events, engaging in collaboration, producing code and so much more, we often forget to take a moment to recognize and adequately showcase the important work being done across the ASF ecosystem. This blog series aims to do just that: shine a spotlight on the projects that help make the ASF community vibrant, diverse and long lasting. We want to share stories, use cases and resources among the ASF community and beyond so that the hard work of ASF communities and their contributors is not overlooked.

If you are part of an ASF project and would like to be showcased, please reach out to markpub@apache.org.

Connect with The ASF

Follow The ASF on social: X, LinkedIn, Bluesky, Fosstodon, and YouTube
Host a Project
Become an ASF Sponsor
Community Resources
Attend Community Over Code

How Open Source Governs Itself: The Story Behind Apache STeVe v3

The ASF

May 12, 2026

By Greg Stein, Apache Software Foundation Member and STeVe Contributor Every year, roughly 800 members of the Apache® Software Foundation cast votes to elect...

Read Post

Apache Geode 2.0, Part II: Rebuilding a Distributed System for the Modern Java Era

The ASF

May 5, 2026

Java 17, Jakarta EE 10, Spring 6—and a thousand dominoes By: Jinwoo HwangLead Developer, Project Lead, and Release Manager, Apache Geode 2.0https://JinwooHwang.com This post...

Read Post

Lessons from Log4Shell: Building a CRA-Ready Log4j

The ASF

April 21, 2026

By: Piotr P. Karwasz, VP Logging, Apache Software Foundation The disclosure of Log4Shell (CVE-2021-44228) on December 9, 2021 did not just expose a vulnerability:...

Read Post