Can you predict whether a new open source project taking its first baby steps is going to succeed, and what factors influence the outcome? A National Science Foundation-funded research team from the University of Massachusetts Amherst, MA, and the University of California, Davis, has been seeking an answer to this and other questions by using Apache Incubator data.
The project ‘Jumpstarting Successful Open-Source Software Projects with Evidence-based Rules and Structures’ was funded by the NSF under its Growing Convergence Research program, part of the NSF’s 10 Big Ideas initiative.
Answering Big Questions
The Growing Convergence Research program provides grants to help solve big challenges in the world, where it is recognized that solutions will require the convergence of disciplines. The program itself describes this as “the merging of ideas, approaches, and technologies from widely diverse fields of knowledge to stimulate innovation and discovery.”
This convergence is reflected in the project team that has been assembled from the two universities (the full team is listed at the end of this blog), according to the lead Principal Investigators (PIs) on the project, Prof. Vladimir Filkov (UC Davis) and Prof. Charlie Schweik (U Mass, Amherst): “We have put together a dream-team of top-notch researchers, from software engineering, computational social science, public policy, cognitive science, and anthropology. Our interdisciplinary convergence is what makes this project possible.”
The project seeks to understand the social, technical, and institutional factors that help developing open source software (OSS) projects.
Apache Incubator as Source
As part of this work, the team chose the Apache Software Foundation (ASF) to understand how nascent and maturing projects in the Apache Incubator build sustainable communities that can create and maintain high-quality software. The team has shared some early results with the ASF from its initial research.
“The Apache Incubator helps incoming projects adopt the Apache Way and operations,” explained Justin Mclean, VP of the Apache Incubator. “The Incubator guides projects, called ‘podlings,’ to operate as self-governing projects to grow their communities to become top-level ASF projects (TLP).
Ultimately, the results indicate that the Apache Incubator is a boon to early community formation, particularly in its policies and how they operate: “We are excited to see this independent research on the Incubator as it offers valuable insights into the program and showed that the core principles of the Apache Way resonate with the community, such as the importance of open communication,” says Justin McLean.
It was the Apache Way‘s open communication requirement that enabled the research team to collect publicly available data from incubating projects at the ASF. The team used four types of data:
- Digital traces of email communication (such as mailing lists), which in research terms are evidence of user activity. NLP was used to understand the content.
- Commit history to establish the digital traces of effort in the project.
- Data from the Apache Incubator Projects page, individual project pages, and project-specific policy documents
- Interviews with 16 mentors.
With this data, the team was able to uncover some fascinating findings, some of which we have discussed below. This research could potentially help the Foundation continue to develop and improve how we support and work with podlings, projects that are in the foundation’s incubation program.
Probable Success Tracker
The most significant outcome of the study was that, using AI methods and the data mentioned above, the team was able to predict incubator graduation with 90% accuracy within eight months. The findings were published at the prestigious ACM Foundations of Software Engineering conference, in 2021, in a paper by Likang Yin et al., Sustainability Forecasting for Apache Incubator Projects.
This was achieved by associating certain variables that impact project success. The team was able to give an instantaneous graduation probability to each project, and this ‘probability tracker’ called ASFI Explorer Project or APEX has the potential to guide future podlings ‘just in time’ by flagging downturns in variables that might affect their potential success.
The results for many graduated and retired projects are available on the APEX GitHub.
Open communication is crucial to the success of any Apache project, and the findings back this up. Any podling taking part in frequent discussions about governance around releases, documentation, and testing were more likely to graduate from the Apache Incubator. The implication is that these are all crucial areas that require good collaboration to achieve graduation.
However, not all governance discussions were as beneficial. For example, generally speaking, it was found that projects that struggled in completing and reviewing their board reports meant a project was less likely to graduate. As this is a vital requirement for graduation and requires being accustomed to completing documents unfamiliar to people not in C-level management roles, there is a logic to this problem leading to the retirement of podlings.
Code Quality Matters
The research also found that code quality and the processes around code impacted the graduation probability. A study on this research was published at the ACM Foundations of Software Engineering conference in 2022 by Stefan Stanciulescu et al, entitled Code, Quality, and Process Metrics in Graduated and Retired ASFI Projects.
In particular, it highlighted that podling code of medium complexity and comprised of many large functions were more likely to graduate. In contrast, podlings were less likely to graduate if they had large file sizes and a high percentage of code duplication. With a nod to the need for diverse contributor types in a healthy community, podlings with both major and minor contributors were also more likely to succeed. The result of this diversity was more commits that fixed bugs and issues and, therefore, tended to lead to successful project graduation.
It was also found that the Apache Incubator’s policies didn’t overburden podlings in how they distribute rules and requirements. In comparison, mentors and Incubator personnel we found to have fewer rules and requirements to follow than podlings and their committers and contributors. As mentors are experienced committers to Apache projects, they would be aware of the ASF’s policies, but it does open an opportunity to discuss what information is publicly available about the role of mentors.
Mentors were recognized as a key part of the benefit offered to podlings in the Apache Incubator. There were a number of notable trends in the mentoring results. For an in-depth explanation of the mentoring component of the study, read team member Curtis Atkisson’s peer-reviewed, open-access research article ‘Mentors Matter: Association of Mentors with Project Success in the Apache Software Foundation Incubator’ published by PLOS ONE.
Broadly speaking, mentors were found to have a positive impact as the gradation probability increased where mentors who had mentored more podlings were involved. There was a limit to this positive effect, and the ‘sweet spot’ appears to be around three podlings and decreased as mentors went from mentoring three to seven projects.
There was also a relative effect from the number of mentors involved; larger mentorship teams increased the chance of graduating. In Curtis Atkisson’s research article, he suggests that this may indicate that “mentors with experience, but only a small amount of experience, are being given more responsibility for the success of the project but may not yet be experts in moving projects toward graduation.”
He suggests that this could be an opportunity for the ASF Incubator to “measure the performance of mentors and seek to ensure that the mentorship team for a project includes people at different levels of experience with mentorship: an experienced mentor to guide the project, an emerging mentor to gain more experience, and a novice mentor to introduce them to the program.”
While Atkisson acknowledges this may be difficult for a volunteer-run foundation like the ASF, he indicates that “it could increase the graduation probability of projects in the Incubator. This type of setup may also benefit other mentorship programs that use mentorship teams composed of mentors with varying levels of experience.”
The study also investigated what motivates participants and how they are managed differs dramatically between those doing work at the podling and the Incubator levels. A key observation from this section was that while podling participants take in volunteers and train and supervise them, the Incubator has a stronger focus on screening and evaluating mentors. The suggestions of the projects’ studies could mean the Apache Incubator may look at playing a more active role in training and supervising mentor volunteers.
Commenting on the findings, Justin Mclean, clarified one assumption the research makes: “It is important to stress that the ASF does not expect, or set, a 100% graduation rate for podlings as the program is intended to assess whether or not a project is a suitable candidate for graduation and TLP status. When a podling is assessed as unsuitable for several reasons, it is not a reflection on the project in question or the individuals involved. It should be seen as an indication that the process is working as intended and that the project may not be a good fit for the Apache Way.”
Looking at Governance
The team has recently begun looking at projects through the dual lens of project activities and project governance. Their first dual-lens study, Open Source Software Sustainability: Combining Institutional Analysis and Socio-Technical Networks, published in the prestigious ACM Conference On Computer-Supported Cooperative Work And Social Computing, 2022, co-authored by Likang Yin and Mahasweta Chakraborti, showed that activities and governance have a strong influence on each other.
The team’s next project will focus in more detail on the governance aspect of the Apache Incubator and the processes that are predictive of graduation. It is hoped the results of this project will help the team continue to develop its APEX tool. As a part of this, the team will be launching a survey directed at participants in projects that have gone through the Incubator, allowing experienced members of Apache communities to contribute their insights. If you have been a part of a project that has gone through the incubator, you can contribute to that survey here.
We would like to thank the team members from both the University of Massachusetts Amherst, MA, and the University of California, Davis, for this research. The team is as follows:
- Curtis Atkisson, anthropologist and computational social scientist at UMass. (Postdoctoral fellow, UMass)
- Brenda Bushouse, associate professor of Public Policy, UMass School of Public Policy and a social scientist with expertise in nonprofit organizations. (Co-PI, UMass)
- Mahasweta Chakraborti, student in Communications. (UC Davis)
- Vladimir Filkov, distinguished computer and data scientist at UC Davis. (Lead PI, UC Davis)
- Seth Frey, cognitive and computational social scientist faculty member in UC Davis’ Department of Communications. (Co-PI, UC Davis)
- Anamika Sen, student in Economics. (UMass)
- Charles Schweik, associate director of Public Interest TechnologyIT@UMass and professor of Environmental Conservation and Public Policy, and a social scientist who studies collective action in “commons” settings. (Lead PI, UMass)
- Stefan Stanciulescu, software engineer. (Postdoctoral researcher, UC Davis)
- Likang Yin, student in Computer Science. (UC Davis)
We are grateful for the team’s initial insights and look forward to what the project uncovers in future research as the ASF continues to pursue ways to optimize programs and the support it provides to fledgling open source software projects.