Calculate your Bus Factor - A practical introduction to GNU-R

Robin Gower, Infonomics

Calculate your Bus Factor

The strength of open source comes from the community, but what would happen if some of it were to go missing?

Robin Gower

  1. What is the Bus Factor and why should you care?
  2. Getting set-up with R
  3. Measuring knowledge - a naive approach!
  4. Extracting information from version control
  5. Calculating the bus factor
  6. Visualising the results
  7. Comparing some open source projects

What is the Bus Factor?

The bus factor is the number of developers that would need to be hit by a bus before a project is in serious trouble.

A low bus factor means that knowledge is concentrated in a few vital people.

A high bus factor means that a broad range of individuals know enough to carry on even if others leave.

Why should you listen to this talk?!

Find out if the kernel could cope without Torvalds

Compare your work to successful open source projects

Read between the lines - learn about your code base with data analysis

Look with your fingers

Getting set-up with R

R logo
R Studio Screenshot

R Studio (AGPLv3) is a popular user interface offering code editor, console, plot viewer, documentation and other tooling for reproducible research

Measuring knowledge

Naive approach: counting lines of code

AuthorLines of Code
Katherine Johnson9000
Dorothy Vaughan4242
Mary Jackson1337
......

Authorship data from version control

Clone a repo

$ git clone git://git.gnupg.org/gnupg.git

Find the name of each line's author


$ git blame --line-porcelain Makefile.am | grep '^author ' | sed -e 's/^author //'

Werner Koch
Werner Koch
Werner Koch
David Shaw
Werner Koch
David Shaw
David Shaw
Werner Koch
...

Get a vector of author names for a file


get_author_for_lines_in_file <- function(repo_path, filename) {
  cmd <- paste("cd", repo_path, "; git blame --line-porcelain", filename,
               "| grep '^author ' | sed -e 's/^author //'")
  system(cmd, intern=T)
}

> get_author_for_lines_in_file("gnupg", "Makefile.am")
[1] "Werner Koch" "Werner Koch" "Werner Koch" "David Shaw" "Werner Koch"     
[6] "David Shaw"  "David Shaw"  "Werner Koch" "David Shaw" "Werner Koch"     
[11] "David Shaw" "David Shaw"  "David Shaw"  "David Shaw" "Werner Koch"
...

Counting with Frequency Tables


> shoes <- c(35,37,37,38,38,38,39,39,39,40,40,40,40,41,41,42,47)
> table(shoes)

shoes
35 37 38 39 40 41 42 47 
 1  2  3  3  4  2  1  1

Counting LOC by Author


> authors <- get_author_for_lines_in_file("gnupg", "Makefile.am")
> table(authors)
authors
David Shaw    Justus Winter Marcus Brinkmann     NIIBE Yutaka      Werner Koch 
        10                1                2                1              129

> as.data.frame(table(authors), responseName="line_count")
           authors line_count
1       David Shaw         10
2    Justus Winter          1
3 Marcus Brinkmann          2
4     NIIBE Yutaka          1
5      Werner Koch        129

Counting LOC by Author


count_of_lines_by_author_in_file <- function(repo_path, filename) {
  line_authors <- get_author_for_lines_in_file(repo_path, filename)
  file_blame <- count_line_authors(line_authors)
  file_blame
}

Need to call this function for all the files in the repo and combine the results.

Get a list of files in the repository


$ git ls-tree HEAD -r | awk -F '\t' '{print $2}'

.gitignore
ABOUT-NLS
AUTHORS
COPYING
COPYING.CC0
...

Get a vector of files in a repo


enumerate_files_in_repo <- function(repo_path) {
  cmd <- paste("cd", repo_path, "; git ls-tree HEAD -r | awk -F '\t' '{print $2}'")
  system(cmd, intern=T)
}

> enumerate_files_in_repo("gnupg")
[1] ".gitignore"                                                           
[2] "ABOUT-NLS"                                                            
[3] "AUTHORS"                                                              
[4] "COPYING"                                                              
[5] "COPYING.CC0"
...

Getting functional in R

Split Apply Combine

The Split, Apply, Combine Strategy

Total LOC per author across all files


count_of_lines_by_author_in_repo <- function(repo_path) {
  repo_tree <- enumerate_files_in_repo(repo_path)
  lines_by_file <- adply(repo_tree, 1, count_of_lines_by_author_in_file,
                         repo_path=repo_path, .progress = "text")
  ddply(lines_by_file, .(author), summarise, line_count=sum(line_count))
}

Total LOC per author across all files


> lba <- count_of_lines_by_author_in_repo("gnupg")
  |==============                                               |  25%

Total LOC per author across all files


> lba <- count_of_lines_by_author_in_repo("gnupg")
  |============================                                 |  50%

Total LOC per author across all files


> lba <- count_of_lines_by_author_in_repo("gnupg")
  |==========================================                   |  75%

Total LOC per author across all files


> lba <- count_of_lines_by_author_in_repo("gnupg")
  |=============================================================| 100%

Total LOC per author across all files


> lba <- count_of_lines_by_author_in_repo("gnupg")
  |=============================================================| 100%

> lba
                                 author line_count
1                           Werner Koch     542336
2                            Repo Admin      51894
3                       Jonas Borgström         71
4                         Justus Winter      48338
5                            David Shaw       2482
6                   Daniel Kahn Gillmor      25051
7                          Jim Meyering         45
8                      Marcus Brinkmann       7887
9                          NIIBE Yutaka      13004

Calculating author contributions


calculate_author_contribution <- function(author_lines) {
  sc <- author_lines[order(author_lines$line_count, decreasing=T),]
  sc$cum_line_count <- cumsum(sc$line_count)
  sc$cum_author_count <- 1:nrow(sc)
  sc$cum_line_percent <- sc$cum_line_count/max(sc$cum_line_count)
  sc$cum_author_percent <- sc$cum_author_count/max(sc$cum_author_count)
  return(sc)
}

Calculating author contributions

author line count cumulative line count cumulative author count cumulative line percent cumulative author percent
Werner Koch 542336 542336 1 0.75 0.02
Repo Admin 51894 594230 2 0.82 0.04
Justus Winter 48338 642568 3 0.89 0.07
Daniel Kahn Gillmor 25051 667619 4 0.92 0.09
Neal H. Walfield 18900 686519 5 0.95 0.11
NIIBE Yutaka 13004 699523 6 0.96 0.13
Marcus Brinkmann 7887 707410 7 0.98 0.16
... ... ... ... ... ...

Calculating the Bus Factor (finally)


calculate_bus_factor <- function(ac, critical_threshold=0.5) {
  critical_contributions <- ac[ac$cum_line_percent < critical_threshold, ]
  nrow(critical_contributions)
}


> calculate_bus_factor(gnupg_ac)
0

The Grammar of Graphics

Wilkinson - Grammar of Graphics

Stop Making Pie Charts!

The Grammar of Graphics


library(ggplot2)
library(scales)
ggplot(ac, aes(cum_author_count, cum_line_percent))
+ geom_point()
+ geom_line()
+ geom_hline(yintercept=0.5, colour="darkgrey")
+ expand_limits(y=0, x=0)
+ scale_y_continuous("Lines of Code", labels=percent)
+ scale_x_log10("Number of Authors", breaks=c(0,10,20,30,40,50))
+ labs(title="Cumulative Author Contributions to GNUPG")
+ theme_minimal()
+ theme_xkcd()

HeartBleed vulnerability in Open SSL

Calculate your Bus Factor

github.com/robsteranium/bus-factor
robin@infonomics.ltd.uk
@robsteranium