Social Aspects of Collaboration in Online Software Communities
Bogdan Vasilescu
Promotor: prof.dr. M.G.J. van den Brand (TU/e)
Co-promotor: dr. A. Serebrenik (TU/e)
Technische Universiteit Eindhoven
Date: 20 October, 2014, 16:00
Thesis: PDF
Summary
Software engineering is inherently a collaborative venture, involving many stakeholders that coordinate their efforts to produce large software systems. In distributed (online) settings, endemic to open-source software (OSS) development, such collaborations span geographies and cultures. Contributors with different skill sets, personalities, cultural backgrounds, or geographic locations self-organise in online software communities and voluntarily contribute to a collaborative effort, such as developing and evolving software systems, offering user support, or sharing knowledge. Myriads of online software communities exist. Among the most visible are those around the Linux operating system and its various distributions, the Gnome desktop environment, the Apache or Mozilla software foundations or, more recently, the GitHub repository hosting platform, or the Stack Exchange network of question and answer sites (e.g., Stack Overflow for programming-related questions).
Different communities operate by different rules and offer different incentives to contribute. More traditional communities such as Gnome or Apache rely mostly on the intrinsic motivations of developers. In contrast, younger communities offer social media and gamification features (e.g., contributors to Stack Overflow are rewarded for their activity with reputation points and badges) as well as increased visibility to peers (e.g., the activity and achievements of GitHub and Stack Overflow contributors are aggregated and displayed on public profile pages). Furthermore, online software communities are often interdependent. For example, a Gnome contributor may engage simultaneously in different communities part of the Gnome ecosystem, where she may take on different roles or instead choose to specialise in similar tasks. Similarly, a GitHub contributor may participate simultaneously in Stack Overflow, where she may seek help from peers or instead share her knowledge and expertise to help educate others.
Starting from the recent realisation of the empirical software engineering research community that social aspects are at least as important for the success of distributed collaborations as technical ones, this dissertation is an attempt to understand collaboration in representative online software communities from a social perspective. To this end, we mines and analyses a wealth of publicly available trace data using state-of-the-art statistical techniques, from various standpoints outlined below. The work in this dissertation sits at the intersection of two research communities, computer-supported cooperative work (CSCW) and software engineering (SE). Consequently, it offers contributions to both SE and CSCW, ranging from methods and tools to mine and analyse large amounts of (social) trace data, to practical implications for software maintainability, software team management, knowledge management or community design.
First, we propose individual-level measures of workload, involvement, and specialisation of labour in online software development communities, and report on a case study of the Gnome ecosystem community. Additionally, we propose a community-level measure of skill diversity with respect to a certain technical skill, such as knowledge of a given programming language. Social interactions between contributors to software development communities and their degree of participation have been reported repeatedly to influence software quality and complexity. Similarly, skill heterogeneity in a software community is important for the community’s survival and performance.
Second, we analyse the effects of simultaneously contributing to multiple communities on individual working rhythms. We focus on developers contributing source code to GitHub repositories while seeking and sharing knowledge on Stack Overflow. Despite the popularity of Stack Overflow, its role in the work cycle of OSS developers is yet to be understood. On the one hand, participation in it has the potential to increase the knowledge of individual developers thus improving and speeding up the development process. On the other hand, participation in Stack Overflow may interrupt the regular working rhythm of a developer, hence also possibly slow down the development process. We show that active GitHub contributors typically engage in Stack Overflow as experts, asking few questions and providing many answers for others. Moreover, despite the interruptions incurred, the Stack Overflow activity rate correlates with the code changing rate on GitHub.
Third, we chart the changes in behaviour of contributors to online knowledge sharing communities as they migrate into gamified environments. To this end, we assemble a joint data set for R (a widely-used tool for data analysis) by integrating data from mailing lists and Stack Exchange sites, having activities of individual contributors linked across the two resources and also over time. We find that user support activities show a strong shift away from mailing lists (historically the hub for development and user support activities in online software communities) and towards Stack Exchange. Moreover, knowledge providers contributing to both communities provide faster answers on Stack Exchange than on the mailing list, and their total output increases after the transition.
Fourth, we revisit the gamified Stack Overflow environment from the perspectives of gender representation and participation patterns, in comparison to traditional information sharing venues such as mailing lists. In addition to encouraging competitiveness through its gamification features, anecdotal evidence around Stack Overflow suggests that it is an unfriendly community that promotes one-upmanship, or fosters flame-wars and the down-voting of individuals. We find that while women are under-represented in all studied communities, they show different participation patterns on Stack Overflow, where they disengage sooner than men. However, relative to the duration of their engagement in the community, women on Stack Overflow are at least as active as men.
Fifth, we address the recurring mining software repositories challenge of inconsistent identity data. To this end, we analyse the robustness of two representative existing identity merging algorithms with respect to different types of noise typical of software repositories, and we propose a new identity merging algorithm inspired by Latent Semantic Analysis, a popular information retrieval technique expected to perform well in presence of noise. Using data extracted from Gnome Git repositories, we evaluate the performance of our algorithm empirically by means of cross-validation, and show an improvement over existing approaches in terms of precision and recall on worst-case input data.
Last, we provide an example of the applicability of methods developed or refined as part of this work beyond online software communities. Specifically, we propose a metrics suite to study the health of software engineering conferences and conference communities, measuring such attributes as stability, openness to new authors, or introversion. Using this metrics suite, we assess the health of 11 established software engineering conferences over a period of more than 10 years. In general, we find that software engineering conferences are healthy, with some notable differences depending on the chosen health metric, or a conference’s wide or narrow scope. Beyond demonstrating the generalisability of our techniques, this latter study has implications for conference steering committees or program committee chairs wishing to assess their selection process, or prospective authors trying to decide in which conferences to publish.