The Apache Software Foundation (ASF) exists to provide software for the public good with support from more than 70 sponsors. ASF’s open source software is used ubiquitously around the world with more than 8,400 committers contributing to 320+ active projects.
As one of the largest open source foundations in the world, ASF has built the “Apache Way,” resulting in a set of standards that promote the sustainable development of open source communities and guide the practice of open source projects.
A Platinum sponsor of ASF, Google also funds many contributions to ASF projects, including two of Apache’s most popular big data projects: Apache Airflow and Apache Beam.
We sat down with some of the Googlers who work on Apache Airflow and Apache Beam to learn more about the projects and their personal experiences navigating the Apache Way. The following are Q&A discussions with them.
Rafal Biegacz is a Senior Engineering Manager within Google Cloud who has been contributing to Apache Airflow for more than three years. In the following Q&A, Rafal shares his experience contributing to an Apache project and his insights surrounding the rising demand for the operationalization and management of data.
ASF: Can you tell us a bit about your journey in open source development?
Rafal: Today, it’s hard to imagine building systems that would not be dependent on open source technologies – they are ubiquitous. But this was not the case when I first started working in open source 20 years ago. Then, in 2019, I joined the Cloud Composer team. This was the first time that I had the opportunity to make contributions to larger open source projects like Apache Airflow.
ASF: What is Apache Airflow?
Rafal: Apache Airflow is an open source workflow management platform for data engineering pipelines. Airflow was started as an open source project at AirBnB in October 2014 and was developed as a solution to manage the company’s increasingly complex workflows. The project’s extensible Python framework enables users and developers to build workflows connecting with virtually any technology.
ASF: What makes contributing to Airflow so exciting?
Rafal: The Apache Airflow community is very welcoming and encourages people with varying levels of OSS experience to get involved. Airflow is a project that makes it easy for newcomers to understand how the community operates and start contributing right away. While there are not a lot of formal processes, it is one of the most popular projects in terms of contributors – a top 10 GitHub project.
As part of the Google team, we wanted to build an offering on top of Airflow to make it more accessible for people to use. Google Cloud Composer was the first commercial offering of Apache Airflow on the cloud, and launched about four years after Aparflow came to the Apache Software Foundation.
There are many reasons why it is exciting to work with a community like Airflow:
- Apache Airflow is the most important workflow definition and execution frameworks in the world;
- Apache Airflow is one of the most active projects within the family of Apache projects with more than 2,000 active contributors;
- Apache Airflow is a very flexible technology that is able to fulfill the needs of both simple and very complex use cases; and
- Community members can contribute in many areas including core Apache Airflow features, Airflow operators, Airflow tests, Airflow documentation and finally Airflow Summit.
ASF: What’s a good starting point when seeking to contribute to an Apache project? How do you go about identifying the needs or gaps?
Rafal: Learning how a community operates is part of working within open source regardless of project. First, it is important to define the problem you believe needs to be solved. Second, the problem should be publicly discussed within the appropriate community channels (Slack, dev lists, etc). Finally, it is helpful to propose a solution, if possible, and volunteer to work on it.
For the Airflow community, it is important to be included in the very beginning and to be able to provide the feedback along the way.
There are a couple of ways to start being involved in Apache Airflow projects:
- Attend “Contributing to Apache Airflow” workshop (https://airflowsummit.org/workshops/);
- Look at Airflow Improvement Proposals and see if you can contribute to any of the improvements. The Airflow community is very welcoming to any contributions whether you are newbie or more experienced with Airflow;
- Look at the list of open issues on GitHub: https://github.com/apache/airflow/issues
If there are issues with no one assigned to it, ask Airflow contributors to assign you;
- Attend Airflow meetups or watch Airflow Summit presentation recordings;
- Join Apache Airflow Dev group – observe the topics discussed;
- Join Apache Airflow Slack channels and ask the community what doesn’t have an owner and potentially you could work on it; and
- Start discussions on Apache Airflow GitHub,
ASF: What community events do the community host?
Rafal: These are some of the top events and community gathering opportunities for those interested in Apache Airflow:
- Airflow Summits: Global conference held annually since 2020;
- Apache Airflow meetups – Apache Airflow meetups are hosted globally from Warsaw, Poland to New York City;
- Google Open Source Day – hosted by Google annually and devoted to Apache Airflow;
- Project-focused Workshops such as Astronomer, Google, Airflow – PMC members deliver workshops to introduce and share knowledge; and
- Outreach Internships, for example: Journey with Airflow as an Outreachy Intern
ASF: In your opinion, what have been the standout features or releases from Airflow?
- The recent addition of Deferrable Operators was an important step forward;
- Recent improvements around Airflow System Tests are going to ensure higher quality of Airflow operators and prevent backward incompatible changes (AIP-47);
- Airflow systems test – Having a robust testing infrastructure ensures that any changes committed to repository aren’t going to break core functionality;
- Data Lineage – the Airflow community continues to work on data lineage as this functionality is of great interest for many Airflow users; and
- Features related to data sets are very important to meet growing demands of users such as data scientists.
ASF: What do you see as future opportunities for the project?
Rafal: Any improvements related to security are of high importance for the team at Google contributing to Apache Airflow. Another core feature for Apache Airflow that Google is contributing to includes adding multi-tenancy functionality to Apache Airflow. Google also cooperates with Airflow Community on AIP-44 and AIP-43 – which provides better separation between users and code.
To learn more about Apache Airflow:
Kenneth (Kenn) Knowles is a Staff Software Engineer at Google and a founding member of Apache Beam, where he works to advance the state of the art in streaming big data computations. Kenn is also a member of ASF and a founding member of the ASF’s Diversity & Inclusion Project.
Kerry Donny-Clark is a manager on the Apache Beam team at Google. Before joining Apache Beam, Kerry was a professional yo-yo player, an English teacher in Japan, a cancer researcher, an elven fighter/mage, a janitor, a circus performer, and various kinds of software engineer. Kerry likes to build things, from furniture to applications, and his hobby is collecting other hobbies.
In the following Q&A, Kenn and Kerry tell us a bit about their experiences working in open source and contributing to open source projects like Apache Beam.
ASF: What is Apache Beam?
Google: Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs allows users to build a program that defines the pipeline. The pipeline can then be executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.
ASF: Can you tell us a bit about the road leading up to open sourcing Apache Beam?
Kenn: When Dataflow launched, the client library SDK was open source but focused only on Dataflow. But the SDK included the ability to plug in other runners – the SDK included a local testing runner and the Dataflow runner. Folks from Data Artisans, the original creators of Flink, experimented with running a Dataflow pipeline with Flink, creating the “Flink Runner”. Independently, folks at Cloudera and PayPal ran a Dataflow pipeline with Spark, creating the “Spark Runner”. Representatives of these projects got together and agreed that it made sense to combine their efforts, to have one advanced data processing model that you could run on Dataflow, Flink and Spark. They merged the repositories for the Dataflow SDK, the Flink Runner, and the Spark Runner, and began incubating at the ASF as Apache Beam. It’s a bit of a unique origin story, because it’s the union of three separate codebases to create a new project at the intersection of communities, rather than the adoption of an existing open source project into the ASF.
ASF: When does something become a viable open source project?
Kenn: There are many types of OSS projects, so it depends. ASF has a particular, powerful approach: a project should be able to survive as funding sources and contributors come and go. This is the gold standard, but we also know that things are more fluid and precarious in a lot of projects.
ASF: Can you tell us a bit about your involvement with Apache Beam?
Kerry: I’ve always been someone who is interested in open source, going as far back as the 90s when open source software was the “cool new thing.”
Apache Beam intrigued me for several reasons. Google had a large team devoted to an Apache open source project, which I saw as a huge advantage. It also struck me how little of my internal Google knowledge was applicable to other industry standard tech. So, working on Apache Beam allowed me to work in a collaborative OSS manner while also getting expertise at standard practices and standard software capabilities and skill sets.
ASF: What has been your experience or takeaways from first joining the Apache Beam community?
Kerry: It was daunting to come into a project with so much support from Google. It was difficult to find out where to start, because Beam is a rich, mature project. And the project itself is quite complex – any time you are dealing with large, parallel data processing, it requires you to think in a different way about how that model is implemented into software.
I started my journey by asking a lot of questions. It’s become much easier to ask questions, watch videos, and consume documentation, but I needed guidance to follow the right path through that documentation. It’s very important to guide new users through the dense forest of new user materials. Discovery – where to start and how to prioritize – is still one of our biggest challenges for community member onboarding.
ASF: What was it like onboarding a whole team to an open source project?
Kerry: It was actually much easier to onboard engineers than myself. I have a lot of experience onboarding new engineers to a new project and follow the same steps every time – partnering with senior people, find projects to dive into, and find first steps, then expand the scope. It is important to give people areas of focus so that they can begin to learn the Beam model through the lens of the project they are working on. From there, they are able to apply their knowledge as they learn other related models within Apache Beam.
ASF: Is there any advice you’d give projects who are thinking about incubating their projects?
Kenn: I’ve been a mentor for a couple of incubating projects. While projects might have different scopes, they all face similar challenges in early days, including learning the governance of ASF and embracing the Apache Way. It’s important to share your intentions and forget any mindset that the open source effort is in conflict with the commercial entity you might work for. The interests are aligned with one another. There is always benefit to sharing and keeping things in the open – it encourages more collaboration, innovation and efficiency.
As an example of the kind of transparency we embrace, the Apache Beam PMC has published our expectations – we made a list of what we are looking for in a committer so that people understand exactly what they need to do. When we discuss whether someone should be a committer, we adhere to that list. We evaluate people against that list and give them feedback according to those expectations.
ASF: Why was the ASF chosen to host Apache Beam?
Kenn: Apache is where big data lives. In a multi cloud world, to make customers more comfortable, it makes sense to choose some foundation, some neutral body where the code lives. And of course two of the three communities that spawned Beam were already part of ASF.
ASF: How can projects encourage more contributors? What are the benefits of newcomers?
Kerry: One of the big benefits newcomers can bring is that they have fresh eyes and can see the gaps – especially in feature sets or documentation. For example, newcomers can provide critical input for critical processes like standardizing flags. Newcomers might be able to see the disparate ways we do this and bring consistency/standardization. Newcomers can also help read documentation that makes it easier for future newcomers. Newcomers can often find the bugs that resonate with them which can lead to future features and updates that benefit many others. We need newcomers to join projects to provide balance of experience levels- it keeps the project alive.
ASF: What’s the focus for the project going forward? Any features or goals to tout?
Kenn: A big thing to call out is the ability to freely share data processing code across languages, for example writing a high performance connector for Kafka and using it in Python or Go. With this technology, we were able to create a Typescript SDK in a week.
I would also highlight that the core of Beam is very stable and now we are developing more easy to use tools on top of Beam, for example all the tools that go around your machine learning (ML) workflow, starting with bulk predictions but to include model validation, etc. The bulk of a successful ML/AI product is the processing that surrounds the training, and Beam aims to be a great way to build that out. That is just one example of our focus on leveling up ease of specific use cases on top of our very mature core model and runners.
To learn more about Apache Beam: