← Back to Blog

Gitlog in Sqlite

HappyFunCorp gitlog in sqlite

Oct 2, 2020

SQL is great


Published August 28, 2020 #git, #sqlite


askgit is a great way to look at information inside of a repository. However it currently doesn’t support looking at the files inside of the commit itself — it gives you a view of the repository at the time of the commit, but not the patch itself. Since I don’t know enough about golang and sqlite virtual tables, let’s just create a sqlite3 database from the logfile.


Get the gitlog


We are going to use our favorite test repo, ruby-git because it’s so deliciously meta.


cd /tmp
git clone https://github.com/ruby-git/ruby-git
cd /tmp/ruby-git

And then, let’s pull out a list of commits, that includes


%hThe hash%aeAuthor email%aIAuthor date%sSummary--numstatFiles with changes in the commit


git log --pretty=format:'|%h|%ae|%an|%aI|%s' --numstat > /tmp/commits.log

We are going to load this file into a sqlite database that we’ll use to do some analysis.


Parse and load


First we need a Gemfile for the sqlite3 gem:


source 'https://rubygems.org'

gem 'sqlite3'

And then we can write a script that loads everything up.



  1. Create the database and tables

  2. Go through commits.log file one line at a time

  3. If the line starts with | parse out the fields and add them to the commits table.

  4. If the line isn’t blank (and isn’t already a |) then parse out the file changes and add them to the commit_files table.

  5. If the commit is already in the database, then abort.


require 'sqlite3'

  def create_database filename
db = SQLite3::Database.new filename

    rows = db.execute <<-SQL1
CREATE TABLE IF NOT EXISTS commits (
id TEXT UNIQUE,
summary TEXT,
author_name TEXT,
author_email TEXT,
author_when DATETIME
);
SQL1

    rows = db.execute <<-SQL2
CREATE TABLE IF NOT EXISTS commit_files (
id TEXT,
name TEXT,
added INT,
deleted INT
);
SQL2

    db
end

  def add_commit db, id, email, name, date, summary
ret = db.execute("INSERT INTO commits (id, summary, author_name, author_email, author_when)
VALUES (?, ?, ?, ?, ?)", [id, summary, name, email, date ] )
end

  def add_file_commit db, id, file, added, deleted
ret = db.execute("INSERT INTO commit_files (id, name, added, deleted)
VALUES (?, ?, ?, ?)", [id, file, added, deleted] )

  end

  def parse_file db, file
commit = nil
File.open( file ).each_line do |line|
line.chomp!
if line[0] == '|'
md = /\|(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*)/.match( line )
commit = md[1]

        puts line

        begin
add_commit db, md[1], md[2], md[3], md[4], md[5]
rescue SQLite3::ConstraintException
puts "Found existing commit, exiting"
exit
end
elsif line.length != 0
md = /([\d|-]*)\s*([\d|-]*)\s*(.*)/.match( line )
add_file_commit db, commit, md[3], md[1], md[2]
end
end
end

  db = create_database "test.db"

  parse_file db, "/tmp/commits.log"

|4bef5ab|[email protected]|James Couball|2020-04-25T14:40:51-07:00|Release v1.7.0
Found existing commit, exiting

Sanity check the data


We can run some of the queries that we did before.


sqlite3 test.db "select count(*) from commits;"
sqlite3 test.db "select min(author_when) from commits;"
sqlite3 test.db "select * from commits where author_when = (select min(author_when) from commits);" --csv

402
2007-11-07T12:54:26-08:00
f5baa11,"beginning of Ruby/Git project","scott Chacon",[email protected],2007-11-07T12:54:26-08:00

We can also look to see who has contributed the most commits in a certain time frame, like say after 2015.


sqlite3 test.db "select count(*), author_name from commits 
where author_when >= '2015-01-01'
group by author_name
having count(*) > 1
order by count(*) desc " -column -header

count(*)    author_name     
---------- ----------------
44 Roberto Decurnex
9 James Couball
9 Per Lundberg
5 Vern Burton
2 Joshua Liebowitz
2 Yuta Harima

Looking into file change patterns


All of this we could have done with askgit which maps directly to the repository without having to do a seperate processing step. Let's see what interesting information we can find looking inside the commits themselves.


Most touched file


This shows how many commits reference a specific file, so we can see which is the most active.


sqlite3 test.db "select count(*), name from commits, commit_files
where commits.id = commit_files.id
and commits.author_when >= '2016-01-01'
group by name order by count(*) desc limit 20" -header

count(*)|name
22|lib/git/lib.rb
9|README.md
8|lib/git/version.rb
7|.travis.yml
7|git.gemspec
4|CHANGELOG.md
4|lib/git/diff.rb
4|tests/units/test_lib.rb
3|PULL_REQUEST_TEMPLATE.md
3|VERSION
3|tests/test_helper.rb
2|.github/stale.yml
2|CHANGELOG
2|CONTRIBUTING.md
2|MAINTAINERS.md
2|lib/git/branch.rb
2|tests/units/test_archive.rb
2|tests/units/test_bare.rb
2|tests/units/test_config.rb
2|tests/units/test_remotes.rb

Most changed files in a time frame


Lets look to see which files have changed the most in a certain time frame. This gives you an idea of where the activity of the project has been focused.


sqlite3 test.db "select count(*) as commits, sum(added) + sum(deleted) as lines_touched, name from commits, commit_files
where commits.id = commit_files.id
and commits.author_when >= '2016-01-01'
group by name
order by lines_touched desc
limit 20" -header

commits|lines_touched|name
22|430|lib/git/lib.rb
2|222|CONTRIBUTING.md
1|196|lib/git/status.rb
1|169|tests/files/encoding/dot_git/hooks/pre-rebase.sample
1|144|tests/units/test_merge_base.rb
1|128|tests/files/encoding/dot_git/hooks/update.sample
4|115|tests/units/test_lib.rb
1|114|tests/files/encoding/dot_git/hooks/fsmonitor-watchman.sample
4|70|lib/git/diff.rb
1|65|tests/units/test_status.rb
1|61|tests/units/test_diff_non_default_encoding.rb
1|57|RELEASING.md
1|53|tests/files/encoding/dot_git/hooks/pre-push.sample
1|52|tests/units/test_object.rb
7|51|.travis.yml
1|49|tests/files/encoding/dot_git/hooks/pre-commit.sample
1|44|tests/units/test_init.rb
2|43|tests/units/test_remotes.rb
1|42|tests/files/encoding/dot_git/hooks/prepare-commit-msg.sample
2|42|tests/units/test_config.rb

Finding commits that changed a specific file


This is a fairly mature library that doesn’t require that many changes. But those changes seem to be focused in lib/git/lib.rb. What do we think was going on?


sqlite3 test.db "select summary, author_name, author_when from commits, commit_files
where commits.id = commit_files.id
and commit_files.name = 'lib/git/lib.rb'
and commits.author_when >= '2016-01-01'
order by author_when desc" -column -header

summary                                              author_name  author_when              
--------------------------------------------------- ----------- -------------------------
Ruby version compatibility conflict solution (#453) TIT 2020-04-05T20:33:35+03:00
Add no verify for commit with documentation (#454) Agora Secur 2020-04-05T17:33:04-05:00
Remove extraneous '--' from `git stash save -- mess Antonio Ter 2020-04-05T17:06:53-03:00
Git::Lib#normalize_encoding early return fix (#461) James Bunch 2020-04-05T14:15:30-07:00
Fix issue with color escape codes after recent upda Marcel Hoye 2020-02-10T22:29:14+01:00
Fix describe command's dirty, abbrev, candidates, a Harald 2020-02-06T02:13:29+01:00
Implementation and tests required to ensure that co James Couba 2019-12-11T10:04:03-08:00
Support merge-base (#370) Evgenii Pec 2018-10-02T10:03:44-04:00
Add support for unshallow (#377) Stephen Pau 2018-08-24T18:06:12-05:00
using File as a block so that it tears down once ev Vern Burton 2018-08-01T09:16:42-05:00
Support 'push --delete' (#325) Kody 2018-07-28T17:33:52+01:00
commit with custom author date (#374) Matias Garc 2018-07-12T16:49:52-03:00
Check if branch contains commit (#174) Kelly Stann 2018-07-12T06:19:21-04:00
config_get: Fix incorrect block name (#373) Joshua Lieb 2018-06-25T11:46:53-07:00
Allow fetch operation to receive a `ref` param (#36 Joshua Lieb 2018-06-25T11:45:38-07:00
Fix space prefix in tag message (#316) Denis Defre 2018-06-20T14:02:19+02:00
Enable set url on remotes (#338) Tom Potts 2018-05-03T13:48:03+01:00
Enable mirror option for git clone and push (#342) Guilherme M 2018-05-03T09:37:17-03:00
Fix UTF-8 message errors (#327) Alexander M 2018-04-30T14:47:26+03:00
Fix ls-files for git 2.16 (#350) Rafael Regg 2018-03-28T08:44:59-03:00
Updating String#encode usage to prevent silly type Roberto Dec 2016-02-25T12:31:33-03:00
Fix the encoding issue - 'split': invalid byte sequ David Varta 2016-02-16T15:42:14+00:00

Change coupling by commit


Let’s look to see which files are changed together, which will give us a sense of which files are coupled together.


First we’ll create query_and_pivot.rb that we'll use to do our query and do the count pivot stuff to parse the output.


require 'sqlite3'

  # Query the database and pivot the second result (filename) around the
# first result (either date or commit id) into a cochange structure
def query_and_loop
db = SQLite3::Database.new "test.db"

    cochanges = yield db

    file_set = []
prev_id = nil

    ret = {}

    cochanges.each do |id,name|
if prev_id != id
add_set ret, file_set
file_set = []
end
file_set << name unless file_set.include? name
prev_id = id
end
add_set ret, file_set
collect_and_sort ret
end

  # Add each set into the cochange structure.  This tracks how many
# times each file was seen with another file, and the number of times
# the file was seen overall.
def add_set cochange, set
set.each do |file|
cochange[file] ||= {}
cochange[file][:count] ||= 0
cochange[file][:count] += 1
set.each do |cofile|
if file != cofile
cochange[file][cofile] ||= 0
cochange[file][cofile] += 1
end
end
end
end

  # Turn the cochange sent into an array that is sorted by the
# cooralation
def collect_and_sort cochange
ret = []

    cochange.each do |file, stats|
commits = stats[:count]
stats.each do |cofile, cocount|
coorelation = cocount / commits.to_f
if cofile != :count && coorelation > 0.3 && cocount > 2
ret << { file: file,
commits: commits,
cofile: cofile,
coorelation: coorelation,
cocount: cocount }
end
end
end

    ret.sort { |a,b| b[:coorelation] <=> a[:coorelation] }
end

  # Prints the results into a table that org-mode can easily reformat.
def print_cochange cochange
puts "| file | commits | cofile | coorelation | count |"
puts "|-|-|-|-|-|"

    cochange.each do |file|
puts "|#{file[:file]}|#{file[:commits]}|#{file[:cofile]}|#{(file[:coorelation]*100).to_i}%|#{file[:cocount]}|"
end
end

And now we’ll create a cochange_by_commit.rb script that does our query.


require './query_and_pivot.rb'

  cochanges = query_and_loop do |db|
db.execute "
select id, name from commit_files where id in
(select commits.id from commits, commit_files
where commits.id = commit_files.id and author_when >= ?
)", ['2011-01-01']
end

  print_cochange cochanges

| file                          | commits | cofile                     | coorelation | count |
|-------------------------------+---------+----------------------------+-------------+-------|
| CHANGELOG.md | 4 | lib/git/version.rb | 100% | 4 |
| lib/git/config.rb | 4 | tests/units/test_config.rb | 100% | 4 |
| Gemfile.lock | 3 | git.gemspec | 100% | 3 |
| tests/units/test_config.rb | 5 | lib/git/config.rb | 80% | 4 |
| lib/git.rb | 4 | git.gemspec | 75% | 3 |
| tests/units/test_remotes.rb | 8 | lib/git/lib.rb | 75% | 6 |
| tests/units/test_tags.rb | 4 | lib/git/lib.rb | 75% | 3 |
| tests/units/test_index_ops.rb | 8 | lib/git/lib.rb | 75% | 6 |
| VERSION | 7 | git.gemspec | 71% | 5 |
| tests/units/test_logger.rb | 6 | lib/git/lib.rb | 66% | 4 |
| lib/git/diff.rb | 8 | lib/git/lib.rb | 62% | 5 |
| tests/units/test_remotes.rb | 8 | lib/git/base.rb | 62% | 5 |
| lib/git/base.rb | 31 | lib/git/lib.rb | 61% | 19 |
| VERSION | 7 | lib/git/version.rb | 57% | 4 |
| tests/units/test_init.rb | 7 | lib/git/lib.rb | 57% | 4 |
| tests/units/test_init.rb | 7 | lib/git/base.rb | 57% | 4 |
| tests/units/test_lib.rb | 12 | lib/git/lib.rb | 50% | 6 |
| tests/units/test_index_ops.rb | 8 | lib/git/base.rb | 50% | 4 |
| tests/test_helper.rb | 9 | lib/git/lib.rb | 44% | 4 |
| lib/git/version.rb | 12 | git.gemspec | 41% | 5 |
| lib/git/version.rb | 12 | CHANGELOG.md | 33% | 4 |
| lib/git/version.rb | 12 | VERSION | 33% | 4 |
| tests/test_helper.rb | 9 | tests/units/test_lib.rb | 33% | 3 |

I’m limiting the output to only show files that have changed more than 2 times, mainly to get the data down to something manageable to show on this point. The first two columns show the file name and the number of commits that it’s in (for the selected time period). The next few columns show the file that changed at the same time as the first time, and the % of times that it was in the commit. So a correlation of 1.0 indicates they change everytime together. For 0.75 it changes 3 out of 4 times that the main file changes.


Most of the things here make sense. Unit tests change at a high coupling rate as the files that they appear to test, but you also can see that lib/git/lib.rb and lib/git/base.rb change a lot with unit tests that are testing other parts of the system. Meaning, the unit test code isn't super modular with the classes that it's testing, and there's probably a bunch of stuff in these base and lib modules that have spread responsibilities.


Why isn’t VERSION totally correlated to lib/git/version.rb ?


Looking above we can see that changes to VERSION are correlated to lib/git/version.rb only 57% of the time. Weird. Let's see if we can figure out what those other changes are.


We are going to use the sql EXCEPT keyword to first get a list of all of the commit ids that reference VERSION, and substract that from the list of ids that reference lib/git/version.rb to get the commits where they aren't co-occurring.


sqlite3 test.db "select id from commits where author_when >= '2011-01-01' 
and id in
(select id from commit_files where name = 'VERSION'
EXCEPT
select id from commit_files where name = 'lib/git/version.rb')"

6d5bacd
6db4fdc
cc6d6ef

Now if we do git show 6d5bacd we can see that, in fact, the change was that VERSION was removed from the repo since it was restructured. Not a super interesting finding, but it's an anomaly so it's fun to check out.


Active days


Let’s see which days of this project were the most active.


sqlite3 test.db "select count(*), date(author_when) from commits 
group by date(author_when) order by count(*) desc limit 10" -header -column

count(*)    date(author_when)
---------- -----------------
16 2013-04-11
14 2015-01-05
12 2007-11-16
12 2007-11-19
12 2014-01-13
11 2015-01-12
10 2009-02-12
9 2013-08-16
8 2009-08-01
8 2013-04-29

Just glancing at this you can see that there’s a flurry of activity that happened over wide gaps of time. Closing in one of the events you can get a sense of what happened — someone tackled a bunch of backlog work on the project and merged quite a few pull requests:


sqlite3 test.db "select author_name, summary from commits where date(author_when) = '2015-01-05'"

Roberto Decurnex|Updating checkout tests to cover `-b` option
Roberto Decurnex|Fixing :new_branch usage on checkout
Roberto Decurnex|Merge pull request #37 from JangoSteve/ls_remote
Roberto Decurnex|Merge pull request #161 from xavier-calland/fetch_prune
Roberto Decurnex|Updating `clone` RDoc Sorting clone options closes #178
Roberto Decurnex|Merge pull request #132 from arnvald/test-branch-create-does-not-switch-branch
Roberto Decurnex|Updating clone --b to clone --branch (just to make it a little more verbose)
Roberto Decurnex|Merge branch 'test_unit_needs_specificity' of https://github.com/kwstannard/ruby-git into kwstannard-test_unit_needs_specificity
Roberto Decurnex|Merge branch 'diff_fix_current_vs_head' of https://github.com/francisluong/ruby-git into francisluong-diff_fix_current_vs_head
Roberto Decurnex|Merge branch 'NotDaveLane-master'
Roberto Decurnex|Adding Git.clone test for :branch option
Roberto Decurnex|Merge branch 'master' of https://github.com/NotDaveLane/ruby-git into NotDaveLane-master
Roberto Decurnex|Merge branch 'master' of github.com:schacon/ruby-git
Roberto Decurnex|Updating ssh_key -> git_ssh

If you were in charge of testing things, you’d definately want to be a bit more thorough than normal.


Change coupling by time


Commits aren’t necessarily the only unit of work on a team. Depending upon the process and work style, commits could be small and frequent (I’m sure with comments like argggh and work dammnit) or they could be well structured units of work that can be thought of as logically discrete units.


For those repositories that aren’t like that, let’s group changes by time and see if that gives us any interesting insights in to which files where modified together.


require './query_and_pivot.rb'

  cochanges = query_and_loop do |db|
db.execute "
select date(author_when), name from commits, commit_files
where commits.id = commit_files.id
and author_when >= ?", ['2011-01-01']
end

  print_cochange cochanges

| file                          | commits | cofile                        | coorelation | count |
|-------------------------------+---------+-------------------------------+-------------+-------|
| CHANGELOG.md | 4 | lib/git/version.rb | 100% | 4 |
| lib/git/config.rb | 4 | tests/units/test_config.rb | 100% | 4 |
| tests/units/test_remotes.rb | 6 | lib/git/lib.rb | 100% | 6 |
| tests/units/test_tags.rb | 3 | lib/git/lib.rb | 100% | 3 |
| tests/units/test_init.rb | 6 | lib/git/lib.rb | 100% | 6 |
| lib/git/object.rb | 4 | lib/git/lib.rb | 100% | 4 |
| lib/git/object.rb | 4 | lib/git/base.rb | 100% | 4 |
| tests/units/test_base.rb | 4 | lib/git/lib.rb | 100% | 4 |
| tests/units/test_index_ops.rb | 6 | lib/git/lib.rb | 100% | 6 |
| lib/git/path.rb | 4 | lib/git/base.rb | 100% | 4 |
| VERSION | 6 | lib/git/lib.rb | 83% | 5 |
| tests/units/test_remotes.rb | 6 | lib/git/base.rb | 83% | 5 |
| lib/git/base.rb | 18 | lib/git/lib.rb | 83% | 15 |
| tests/units/test_init.rb | 6 | lib/git/base.rb | 83% | 5 |
| tests/units/test_config.rb | 5 | lib/git/config.rb | 80% | 4 |
| lib/git.rb | 4 | git.gemspec | 75% | 3 |
| lib/git/diff.rb | 8 | lib/git/lib.rb | 75% | 6 |
| lib/git/config.rb | 4 | README.md | 75% | 3 |
| tests/units/test_base.rb | 4 | .travis.yml | 75% | 3 |
| lib/git/path.rb | 4 | .travis.yml | 75% | 3 |
| lib/git/path.rb | 4 | lib/git/lib.rb | 75% | 3 |
| lib/git/path.rb | 4 | tests/units/test_index_ops.rb | 75% | 3 |
| .travis.yml | 14 | lib/git/lib.rb | 71% | 10 |
| tests/units/test_logger.rb | 6 | lib/git/lib.rb | 66% | 4 |
| VERSION | 6 | git.gemspec | 66% | 4 |
| tests/units/test_index_ops.rb | 6 | .travis.yml | 66% | 4 |
| tests/units/test_index_ops.rb | 6 | lib/git/base.rb | 66% | 4 |
| tests/units/test_lib.rb | 11 | lib/git/lib.rb | 63% | 7 |

I truncated the table since there are a lot more things that are correlated now that we’ve grouped by date. A lot of things make sense here, but again it seems like files are changed together — a lot which could be nothing but makes you wonder about modularity of files.


lib/git/base.rb and lib/git/lib.rb are both changed a lot (18 times) and frequently changed together 83% of the time.


Perhaps things have been split up prematurely if multiple files always change at the same time? Or is this normal in terms of how object oriented encapulsation works? Based upon the generic all encompasing names (base and lib) I wonder if you could guess why something would go into one file or another.


Knowledge and maturity



  1. Knowledge of a file is related to how much someone touched it

  2. Knowledge decays over time

  3. Active files means that something is changing

  4. People come and go on projects



  • Normalizing names


One thing to note in this data is that scott Chacon is treated differently than Scott Chacon, similar to Roberto Decurnex and robertodecurnex. One simple way to normalize this is to update your index database to add the alias and mappings there.


These need to be identified by hand and retweaked depending upon the particular repo you are looking at.


sqlite3 test.db "update commits set author_name = 'Scott Chacon' where author_name = 'scott Chacon'"
sqlite3 test.db "update commits set author_name = 'Roberto Decurnex' where author_name = 'robertodecurnex'"

Authorship of a file


Let’s look at who “knows” about a file based upon how much they’ve authored it. I’m going to pick on lib/git/base.rb since that seems to be an exciting hotspot.


sqlite3 test.db "select name, author_name, count(commits.id) as commits, date(max(author_when)) as last_touched
from commits, commit_files
where commits.id = commit_files.id
and name = 'lib/git/base.rb'
group by name, author_name order by count(commits.id) desc" -column -header

name             author_name   commits     last_touched
--------------- ------------ ---------- ------------
lib/git/base.rb Scott Chacon 30 2008-05-27
lib/git/base.rb Roberto Decu 23 2015-01-12
lib/git/base.rb Jorge Bernal 3 2008-05-06
lib/git/base.rb Joe Moore 2 2013-04-29
lib/git/base.rb Bryan Larsen 1 2009-10-13
lib/git/base.rb Cameron Wals 1 2014-01-28
lib/git/base.rb Daniel Mendl 1 2009-02-10
lib/git/base.rb Harald Sitte 1 2015-01-12
lib/git/base.rb James Rosen 1 2008-12-21
lib/git/base.rb Jonathan Rud 1 2008-12-20
lib/git/base.rb Joshua Peek 1 2008-03-15
lib/git/base.rb Ken Pratt 1 2008-08-13
lib/git/base.rb Michael Mall 1 2013-05-23
lib/git/base.rb TJ Biddle 1 2013-06-14
lib/git/base.rb Tom Potts 1 2018-05-03
lib/git/base.rb Yuya.Nishida 1 2014-07-07

Looks like Scott Chacon and Jorge Bernal moved on to other things in 2008, and Roberto did a bunch of work in 2015 but the file has been untouched until Tom Potts did something in 2018.


Note that Roberto last touched the specific file lib/git/base.rb on 2015-01-12 though he was last active in the repo for another year.


sqlite3 test.db "select author_name, date(max(author_when)) as last_active 
from commits where author_name = 'Roberto Decurnex'" -column -header

author_name       last_active
---------------- -----------
Roberto Decurnex 2016-02-25

Authorship by weight


We can also look to see how many lines of code were added by someone, to go a little deeper. Maybe they only made 1 commit, but it changed every last thing?


sqlite3 test.db "select author_name, date(max(author_when)) as last_touched, sum(added) as added
from commits, commit_files where commits.id = commit_files.id and name = 'lib/git/base.rb'
group by author_name order by added desc" -column -header

author_name   last_touched  added     
------------ ------------ ----------
Scott Chacon 2008-05-27 520
Roberto Decu 2015-01-12 192
Jorge Bernal 2008-05-06 23
James Rosen 2008-12-21 18
Joe Moore 2013-04-29 13
Tom Potts 2018-05-03 11
TJ Biddle 2013-06-14 9
Daniel Mendl 2009-02-10 6
Joshua Peek 2008-03-15 4
Bryan Larsen 2009-10-13 2
Cameron Wals 2014-01-28 2
Jonathan Rud 2008-12-20 2
Ken Pratt 2008-08-13 2
Michael Mall 2013-05-23 2
Yuya.Nishida 2014-07-07 2
Harald Sitte 2015-01-12 1

We do see that Tom Potts moved up a bit in the list.


Who knows about that file?


If you had a question about lib/git/base.rb who would you ask? Scott Chacon wrote most of it, but he hasn't been around in a while. Roberto Decurnex was the last person to make substantial changes to it, but that was back in 2015. Even if he was still active on the project, which he doesn't appear to be, do we think he'd remember any of the details? Does the most recent contributor make sense to talk to, even though Tom Potts only made a small change in 2018?


We know that the file is signifigant because it used to change a lot with a bunch of other files. Maybe the file is super clean and looking at it means that all of its resposibilities are self-evident — I haven’t opened it up yet — but my sense is that it probably has a lot of stuff in there, and that knowledge will have to be recreated the next time that anyone opens it up to make a change.


Stable or dead?


Instead of focusing on the things that have changed, lets take a look at the things that haven’t. Right now we don’t have a good way to see if a file is still in latest revision, so we can put that into the database as well and join off of it.


require 'sqlite3'

  db = SQLite3::Database.new "test.db"

  db.execute <<-SQL
create table if not exists current_files (
name TEXT );
SQL

  db.execute "delete from current_files;"

  count = 0
`cd /tmp/ruby-git;git ls-files`.each_line do |file|
file.chomp!

    db.execute "insert into current_files (name) VALUES (?)", [file]

count += 1
end

  puts "#{count} current files"

596 current files

Now that we have that list, we can join off of it to get when things were last modified and by whom.


I’m going to filter out the test files since there's a lot of sample data there, not that test code isn't important.


sqlite3 test.db "select current_files.name, 
date(max(author_when)) as last_touched,
author_name as author
from commits, commit_files, current_files
where commits.id = commit_files.id
and commit_files.name = current_files.name
and current_files.name not like 'test%'
group by current_files.name order by max(author_when) asc" -column -header

name                          last_touched  author      
---------------------------- ------------ ------------
lib/git/working_directory.rb 2007-11-08 Scott Chacon
lib/git/index.rb 2007-11-16 Scott Chacon
LICENSE 2008-05-26 Scott Chacon
lib/git/stash.rb 2008-12-21 James Rosen
lib/git/branches.rb 2013-04-10 Roberto Decu
lib/git/repository.rb 2013-04-10 Roberto Decu
.gitignore 2013-04-11 Roberto Decu
lib/git/log.rb 2013-08-17 Roberto Decu
Gemfile 2014-01-13 Roberto Decu
lib/git/remote.rb 2014-01-28 Cameron Wals
lib/git/path.rb 2014-07-07 Yuya.Nishida
lib/git/author.rb 2015-01-12 Roberto Decu
lib/git/object.rb 2015-01-12 Roberto Decu
lib/git/status.rb 2018-03-07 Vern Burton
lib/git/base.rb 2018-05-03 Tom Potts
Rakefile 2018-05-16 Per Lundberg
ISSUE_TEMPLATE.md 2018-06-20 Vern Burton
lib/git/stashes.rb 2018-08-01 Vern Burton
.travis.yml 2018-08-22 Vern Burton
lib/git/base/factory.rb 2018-10-02 Evgenii Pech
lib/git/config.rb 2019-09-20 Salim Afiune
git.gemspec 2019-12-11 James Coubal
lib/git/diff.rb 2019-12-11 James Coubal
.github/stale.yml 2020-01-19 Per Lundberg
lib/git.rb 2020-01-22 cyclotron3k
lib/git/branch.rb 2020-01-23 James Coubal
CONTRIBUTING.md 2020-01-25 James Coubal
MAINTAINERS.md 2020-01-25 James Coubal
RELEASING.md 2020-01-25 James Coubal
PULL_REQUEST_TEMPLATE.md 2020-02-04 Yuta Harima
README.md 2020-04-05 Alex Mayer
lib/git/lib.rb 2020-04-05 TIT
CHANGELOG.md 2020-04-25 James Coubal
lib/git/version.rb 2020-04-25 James Coubal

So from this we can see which parts of the project are pretty much stable from early days, working_directory.rb probably up through object.rb and which things have been changing for this stable project. In the beginning of the year there was a bunch of work for things related to project adminstration, like PULL_REQUEST_TEMPLATE.mb and RELEASING.md.


Thoughts


Coding creates artifacts that change over time, and we have access to this data in the repositories that we share. Having everything accessable in SQL helps with complicated joins and analysis. I’ve scripted this in Ruby and we are looking at a Ruby project, but nothing that we’ve done has looked inside of the project itself to identify what are areas of note and of interest. It’s been purely on the shape of how things are changing, what is changing together, and meta data analysis.


Now that we’ve gotten a taste of what this looked like here — and I’m sure there’s much more we can look into — there are a few other sources of meta data that we should look at:



  1. Issues/tickets and how they related to the code

  2. Mailing lists? Chat rooms? other discussions around the repository

  3. Projects that rely upon this as a dependency, and how/if those users are putting in tickets or bug fixes into the code

  4. Looking into the gemspec and Gemfile to examine this project's dependencies to see what's going on there.


But this is already long enough.


References



  1. Your Code as a Crime Scene: Use Forensic Techniques to Arrest Defects, Bottlenecks, and Bad Design in Your Programs

  2. Software Design X-Rays: Fix Technical Debt with Behavioral Code Analysis

← Back to Blog