skud: (Default)
skud ([personal profile] skud) wrote2011-11-14 08:01 pm
Entry tags:

Github, transparency, and the OTW Archive project

Well, I thought I wasn’t going to get involved in this, but apparently I am. This is not an OTW elections post, but it is a how-the-archive-project-is-run post, and a what-the-hell-went-wrong-with-that-release-the-other-day post.

First up, let me state my background and biases. I’m an open source software developer specialising in complex, database-backed web applications, with 10-15 years experience as such. I first developed a website of that kind in about 1996, and my most recent full-time job working with the "LAMP stack" (Linux, Apache, MySQL, and Perl) was as one of the lead developers on, one of the biggest and most highly trafficked websites in Australia. I left there to work on Freebase, a site organising vast quantities of structured information (including information about TV shows, movies, books, their actors, characters, fictional locations, episodes, release dates, etc) and providing access to it via a free API. Along the way, I've been involved in a great many open source projects, at times in a leadership role. So I have a lot of experience in web development, open source tools and processes, and open/volunteer projects to organise lots of information. Sound like the Archive? Yeah, pretty much.

In 2008 when the OTW and AO3 were kicking off, I briefly considered running for chair of AD&T. However, I was moving countries at the time (Australia to the US) and I heard that Naomi Novik wanted to lead it, so I left it to her. Sometime in 2008-2009, I was briefly on the AD&T mailing list but I got tired of it and left after a while. Ever since then, I’ve occasionally cornered AD&T staffers and let them know that I’m interested in the AO3, would like to code on it, but that I find their processes heavyweight and I’m not interested in going through them just to submit a patch or whatever. I recognise (and have told them) that the processes, meetings, training, structure, etc are useful to many people, especially those who don’t have experience with the tools and need to be taught from the ground up, but that as an experienced coder and open source contributor, I personally found them to be a barrier to my involvement.

Large open source projects have developed a range of tools and techniques over the years, and I’m going to take a little diversion into them so you understand where I’m coming from. For the record, these tools and techniques aren’t perfect, but they’re commonplace, so they’re worth knowing about.

Most open source projects have the following:

  • mailing list (long-form discussions)
  • IRC channel (ambient presence, quick turnaround help, casual chat)
  • wiki or other collaborative documentation (developer how-tos, coding styleguides, tutorials, etc)
  • issue tracker (known bugs, desired features, etc)
  • version control system (code storage, workflow management)

Generally, all of these are publicly accessible and anyone can read/join/participate in whichever of them best suit their needs. For example, check the websites for Inkscape (look under "Developers" in the sidebar), Fedora Linux (look in bottom nav area), or indeed Dreamwidth itself (click through to wiki for more developer info). The way the information’s laid out varies, but generally, you’ll find pointers to those things (mailing list, irc, docs, issue tracker, and source) fairly readily.

Now I look at the AO3 and what do I find? The FAQ doesn’t tell me anything about how to become a developer, neither under technical questions, tutorials, nor general. The "Known Issues" link is a static page with no link to the issue tracker. (If you happen to find it some other way and try to submit a bug report, you're told to use the support tool, or login if you're a project member.)

Finally, the last link in the footer is to the latest version of the AO3 code on github. This page includes no visible documentation. However, the main project page has a README file that talks about what the OTW Archive software is and how to get involved -- by filling in a volunteer form on the OTW website. There’s also a link to the Archive roadmap, which says "The Archive of Our Own is in beta as of October, 2008" and lists a bunch of release milestones but doesn’t mention their current status (nor do they mention any future directions regarding fanart, vids, etc., which I know have been under discussion.) If I had come to that project page from the OTW homepage rather than indirectly via the AO3 footer, I would have found no link whatsoever to the source code on github, nor any other developer resources.

This is what I mean when I say that the AO3/AD&T volunteer system is inaccessible to me.

Here’s a real world example. There’s a small feature I would like to implement on the AO3 website. Specifically, I think it would be useful to have author, work title, and fandom listed on the "adult content" warning page, because sometimes I open a bunch of tabs not realising I’m not logged in, then find a bunch of identical-looking pages, and get all confused. As an experienced web developer, I know this is well within my abilities, even though I’m not particularly familiar with Ruby on Rails. Here’s what I’d like to do:

  • Check the public issue tracker and see if this is a known task. If it is, and nobody’s actively working on it, comment on the task saying, "Hey, I’ll do this."
  • If it’s not visible in the issue tracker, I’d probably jump on IRC or the mailing list to introduce myself and say "I’d like to do this, okay?" IRC for preference -- I already have the client open most of the time -- but I’ll join the mailing list if I foresee myself getting deeply involved in development work.
  • Check out the codebase and hack on it. I would use the project’s wiki to understand the layout of the code, how to get the app up and running, how to test it, any expected coding style guidelines, etc.
  • If I get stuck, ask on IRC or the mailing list.
  • Submit a patch, probably via the distributed version control system, but perhaps via some other mechanism.
  • Have the patch reviewed and either have it accepted, or be told I need to tweak it slightly because I missed something. I would expect that if the patch were going to be outright rejected (i.e. it’s fundamentally undesirable) that I would have found this out when I asked on IRC or the mailing list, earlier.
  • A while later, see my patch included in a release.
  • Go on my way, happy that I have contributed something (and can figure out what fic it is I opened in that tab, yay!)

Here’s my understanding of how it works with AO3, at least until recently:

  • Stare at website. Poke around convoluted links. Eventually find volunteer form. (Or maybe you already know how OTW works so you went there immediately.)
  • Fill in volunteer form. Wait an unknown period of time.
  • Get asked to join a committee and/or undergo training (not clear on what exactly is involved these days, but something like this I’m guessing).
  • ???
  • ???

So, this is why I haven’t been involved in AO3 coding so far. As far as I can tell, the AO3 project either doesn’t have some of the things I need to be able to participate in the way I want, has them but doesn’t provide public access to them, or has them but doesn’t make them findable. All in all, it just makes me sigh and say "too hard".

Hopefully that’s useful background. Now I’d like to talk about version control systems. I’ve mentioned them in passing, above, but they’re really important so I’m going to go off into a bit of a tangent explaining them. If you know about them already, you might want to skim this bit.

I’m going to start out with an example most fanfic authors will be familiar with. Let’s say you’re working on a fic, and you send it to your beta. They’re reading through it, and at the same time (because you can’t help yourself), you’re still editing on your master version. At some point, you’ll probably get confused about who’s looking at what, who has the latest version, and so on. If you have cowriters, this gets even worse. Yes?

So, working on code is the same -- or even worse, because losing track of the latest version can make all kinds of stuff break, even at a distance.

Source code is made up of files, basically just text files. You can work on them in a text editor, or in a fancy code editing application, but fundamentally it’s all just text files. So let’s imagine you have two people working on the same file. And let’s imagine that the file’s stored on a shared fileserver somewhere, so you both have access to it. In order for them not to make a mess of things, the two people are going to have to be very careful about who’s working on what.

Person A: Hey, I want to work on file X.
Person B: Go ahead, I’m not using it.
Person A: *works on file X*
Person A: OK, I’m done!
Person B: Cool, I wanted to change something in there, so I’m going to grab it now.
Person A: Go ahead!

So far so good, but sometimes you get all tangled up.

Person A: Whoa, it says you have this file open already.
Person B: Oops, I forgot to save it! OK, I’ve saved it. You can go ahead now.
Person A: But wait, if I save mine now, won’t that overwrite yours?
Person B: … Oh. Um.

Ever heard of the "talking stick"? Sometimes in some forms of meetings, only the person with the talking stick is allowed to speak. When they’re done they hand the stick to someone else. Some coding teams use a similar mechanism. For instance, one team lost in the dawn of history (except that their practice bestowed a silly name on the current leader of the Perl project) had a stuffed toy pumpkin. Only the person with the stuffed toy pumpkin on their desk was allowed to edit the source code at any given time.

Luckily for teams not in physical proximity, you can get the computer to do this for you. It’s called "file locking". One of the first version control systems I ever used, RCS, worked like this. To work on a file you would type co filename to check out the file, as in checking it out from a library. This locks it so nobody else can use it. Then when you’re done editing you’d type ci filename to check it back in. When you checked in the file, it would ask you for a checkin summary, where you’d type a short description of what you’d done, eg. "Fixed an off-by-one bug in the foo() subroutine".

While you were working on a file, nobody else could check it out. RCS’s "lock" was like the talking stick or the pumpkin.

As a bonus, RCS let you do some other handy things as well. You could run rcsdiff ("diff" as in "difference") to see what changes you’d made since you checked the file out, comparing it to the previous version. You could also type rlog to get a log of changes to a given file, based on previous check-in messages. If you had multiple people working on a file, it would tell you who had done what, and when.

So far so good, but it was still a problem that only one person could work on a file at a time. What if it was an enormously long file, and Person A wanted to work on the first section, while Person B wanted to work on something at the very end? They wouldn’t be stepping on each other, so why couldn’t they work on it at the same time? And, of course, there was the eternal problem of Person A checking out a file and keeping it for ages and ages, like an endlessly-renewed library book, while Person B was twiddling their thumbs and couldn’t work on anything.

Enter the Concurrent Versioning System, aka CVS. This was what I really cut my teeth coding on. It was great (in its day, i.e. the late 90s) because multiple people could work on stuff at the same time. You could check out the code, work on it, check it in again, and merge your work with the other contributors’ work. Mostly this would happen seamlessly, though occasionally you’d have to step in and fix things up if two people were working on the same part of the same file. At least it gave you tools to understand what was going on and do this without too much trouble.

CVS wasn’t perfect though. Its main problem was that if someone went off to work independently for a long while, they couldn’t check in their intermediate steps. If their hard drive died, or something went wrong while they were working by themselves, they’d lose everything. They couldn’t collaborate with a smaller team off on that sub-project, they had to work alone. And, when they did come to check things in, it would be an enormous wodge of code with a checkin message like "Changed everything totally, wheee!" which wasn’t much use for other project participants who wanted to understand the step-by-step process.

The fix for most of these problems is called branching. The main development of the code you’re working on is the "trunk". You can work on the trunk, or if you want to go off on a sub-project for a while, you can create a "branch" and work independently. The version control system keeps track of all your independent work, leting you check in changes on the branch and make comments along the way. When you reach a good point, you can merge your branch back into the trunk, which picks up the history of everything you've done. Subversion (aka SVN) is a version control system similar to CVS, but with better support for branching.

CVS and SVN required a centralised server to host the project. If you didn’t have network access, you couldn’t check in. If your central server died, you’d lose all the history of everyone’s changes. It was a very hierarchical, centralised approach, and it certainly had some drawbacks.

In the last, oh, five years or so we’ve seen a number of new version control systems come into prominence that address this centralisation problem. They’re called distributed version control systems (DVCS). In a DVCS, there’s no real "trunk" or "branch", no central point of failure. Anyone can branch from anywhere and merge to anywhere. As a bonus, you can work offline without talking to the central server, then when you want to, you can connect to anyone else’s repository and merge with them. The Wikipedia article (linked above) provides a pretty good overview of the benefits. Common DVCSs include Git (used by the AO3 project) and Mercurial (used by the Dreamwidth project).

Although you don’t have to have a centralised server with DVCSs, it’s common for most open source projects to have some repository that they can point at and say "Here’s our code" and, more importantly, "If you want to work on our project, you can take a copy of the code from here, and submit patches back to it." This repository isn't really any more "officially central" than the ones on each developer's personal computer, but it is more convenient to point people to just one known place where your code can be found.

Github is a service that provides hosting for these public repositories, for free, to open source projects. There are a bunch of other sites that provide similar hosting, but Github wins over them because it also offers some snazzy web interfaces and social features, which means that it’s become a fantastic tool for making an open source project’s development more transparent, and encouraging new contributions. Github, as it happens, is also where the AO3 code is currently hosted. I'll get to that later. For now, here are some interesting github pages that should give you a flavour of what the site is like.

That last link is the one I really wanted to get to. Take a look at it now. Here’s a screenshot just in case time’s passed and you’re reading this far in the future and things have changed:

network diagram showing one main line for the master development branch, then branch forking off and then merging

That’s one evening’s development on a very simple project. What happened was:

  • I created a new repository and uploaded a few files. Hover over the first black dot (in the live version, not the image above) and you'll see.
  • I made a few small changes. These are indicated by the black dots to the left. (You read from left to right.) If you hover over any of those first few black dots, you can see my name and the commit message. If you click on the dot, it will take you to a separate page for the commit, like this one where I first uploaded the files to the writtenkitten project.
  • Then another user, "myfreeweb" (Grigory), came along and forked the project. (Fork, branch, more or less the same thing.) Incidentally, he didn’t do this without talking to me first -- he mentioned on Twitter that he’d like to improve the code, so I put it on github with this particularly in mind.
  • Grigory worked on the code for a while, making several commits to his own branch (the blue dots).
  • Grigory then sent me this pull request, which meant I got an email saying, "Grigory’s done all this work, and would like you to pull it across and merge it in to your branch."
  • After a little back-and-forth (which you can read in the pull request linked above), I accepted the pull request and merged his code in to mine. You can see this in the diagram where the blue line rejoins the black line.
  • I then made a couple more changes, which show as more black dots.

So, that was epic, but if you’ve made it this far, you now understand about version control systems, github, and how they offer collaboration tools for distributed teams, and transparency into the changes that are being made to a codebase.

Now I’d like to actually talk about the AO3 project on github. You can see it at (As an aside, it’s a pity that it uses the name "otwarchive", because it makes it hard to search for it if you know it as "ao3" or "archiveofourown". But I digress.)

Some links of interest:

As an outsider, this actually gives me some pretty interesting insight into the project, including the release that occurred the other day. (If you missed it, you should know that the release was pretty buggy, and that led to a representative of the OTW board apologising for it and one of the coders who worked on the skins system, [personal profile] lim, resigning.)

Now, I have to admit that I am not a heavy github user, and that most of my development work has been done in CVS and SVN, but I am pretty familiar with the process of branching, merging, and release management. Here’s what I would have expected to see in the AO3 project.

  • A large branch for each major project being undertaken. That is, I would have expected to see one for yuletide-related work, one for skins-related work, one for search performance improvements, etc.
  • One person, or perhaps a small handful, working on each branch. For big projects, I would expect to see one main coder and a couple of others pitching in. There might be smaller branches (twigs? saplings?) off these project-level branches.
  • These branches should merge into the "master" branch at regular intervals, and many of them would converge on the master branch as a major release approached.
  • A healthy project will also, typically, have a bunch of branches that go nowhere, as people make a branch to experiment with something, and never merge it back in.

Here’s a project that follows the pattern I’ve described above: Rails. (Rails (aka Ruby on Rails) is the software that underlies the AO3, in fact.) Their network graph is enormous, but here’s a screencap of part of it.

large network diagram, showing many branches merging into master

This shows a section of the graph from about a week before the latest release (release date August 30th, this shows August 23rd). What you can see here is a lot of branches, many of them merging back into the master branch (black, at the top) as the release date approaches.

So, here’s the AO3 project about a week before the release that went out on November 11th:

otwarchive network graph, showing various branches merging into the master

Again, you can see it live on and scroll back to early November, then hover over the dots to see individual commits. If you click on a dot you’ll see a whole page describing a commit.

So, what I would expect to see here is a big branch for the skins work that’s going on, and another for any Yuletide-related preparation, and another for any other set of work that’s being done. The skins and yuletide stuff, being fairly major projects, should be fairly long-lasting branches. But they’re not there. The longest-lasting project branches I can see are both Cesy's, working on prompt meme stuff (bright pink) and collections (yellow), with several commits on each branch before merging into the master.

Instead, I see username "shalott" (errr, I’m not sure whether I’m allowed to connect that to her wallet name, but I’m assuming everyone knows who that is?) creating very short lived branches and making a bunch of changes like:

Fixes for many bugs, improvements to signup and tag set nominations, fixed logo wooo (42 changed files with 286 additions and 141 deletions).

Another slew of fixes and all of lim's amazing preloaded skins \o/ (136 changed files with 1,184 additions and 313 deletions)

A few more updates to the site CSS, a few more small image resources, and the stack of amazing demo skins lim has written and a rake task to load them. (17 changed files with 1,074 additions and 2 deletions)

Many, many fixes and continuing improvements as well as a host of jointly collected image resources. YAY for lightgetsin for amazing screenreader feedback \o/ (118 changed files with 267 additions and 248 deletions)

In other words, shalott is aggregating big piles of work, sometimes from multiple sources, and committing them all in a lump -- sometimes over a thousand lines of code in one commit, with a vague commit message, then rapidly merging the results into the master branch. These merges often happened in quick succession, with no apparent testing or revision before they were done. (Cesy's branches show more accurately what happens when a branch is tested before merge -- a few commits for the main work, then some fix-ups and tweaks to unit tests, and the like, before the final merge.)

This does not seem, to me, to be a well managed project. This is a project where the project lead is acting as a gatekeeper, commiting huge swathes of code (sometimes on behalf of third parties) with inadequate documentation, and allowing extremely poor branch hygiene (skins project mixed up with other changes, for example) to infect the main branch, leading to a buggy release. This should not have happened, and, I suspect, would not have happened if the OTW’s technical leaders had had, or had sought the advice of people who had, experience with distributed open source software development projects and the tools they typically use.

(Worst of all, I think that this poor management is directly responsible for [personal profile] lim resigning. As a relatively inexperienced coder, she should have been shown how to use a branch for her work. If she wasn't able to for some reason, and needed the project lead to proxy her work, the project lead should have made a branch for it. Without this in place, her code was all mixed up with the stuff that had to be done for this Yuletide release, and so she was forced to rush and work 14 hour days to get it in. That sort of overwork just leads to more errors. None of this would have happened if the project lead had insisted on proper branching for each sub-project -- a simple process that any open source software team lead should do as a matter of course. And lim thinks it's her own fault, which is the worst thing of all.)

The AO3 software development project seems to desperately need people who have experience with version control systems, with test driven development ([personal profile] jennyst has written about this a bit), and with release and deployment techniques for large-scale web apps. It also needs people who understand how to design systems that perform at scale, skilled user experience and UI designers, and a whole slew of other specialised skills.

Now I think it is fantastic that the AO3 welcomes and trains up new developers. Seriously, I think it is one of the best things about the project, and I would love it if the open source world at large (indeed, the whole tech community/industry) would do more of it. But it needs to be balanced by having people who know what they’re doing. It needs technical leads who have more experience with large-scale web app development than a single web app that was known not to scale. Most of all it needs project managers who know how to manage a project of this scale, following best practices and so forth.

The thing is, experienced software developers -- I’m one, and I know dozens of others who are probably reading this, many of whom I’ve spoken to about this recently -- don’t want to touch the AO3 project with a bargepole. They can’t figure out how to get involved without being treated like newbs, or how to get their experience and skill recognised. And many of them, who already know how to code and use version control and all of that, would like to contribute just casually, adding a patch here or there, rather than having to commit as an "official" volunteer. And, from what I hear, those who do go through the whole rigmarole are then... not listened to, and leave pretty promptly.

I don’t know whether this is fixable or not, but I hope it is. I have some hope that the move to github, which allows anyone to create a branch to work on, will actually allow the more experienced coders (like me) to get involved at a level we’re comfortable with. And hopefully, if there are more of us around, we’ll get listened to a bit more, and all that experience and skill can actually be put to use by the organisation, rather than brushed off.

Anyway, if you’re an experienced coder, you now know where the AO3 code is, and you know what to do with it. Here’s my fork.
jennyst: Jenny on a photo of space (Default)

[personal profile] jennyst 2011-11-15 01:28 pm (UTC)(link)
We're in need of better incident management as well as release management. We don't have any documented process for managing an incident on production for AO3. I have a post I started drafting a while ago about service operations, IM, 1st line, 2nd line, 3rd line support, how they are different and why having AD&T chair being also IM and 3rd line lead is a recipe for disaster, particularly if they're already tired from being release manager and deployment manager. Someone may be great at all five jobs and capable of doing any of them, but you can't do all of them at the same time in a crisis. In 2009 and 2010, the AD&T chair avoided coding, so that split some of it up, but it was still an issue - I wrote my notes on why we urgently needed a separate, defined role of release manager in late 2010. My previous role in my day job involved deployment management, release management and 3rd line team management, and mixing those three was bad enough, but that had a much smaller volume of change than AO3.

The team have been doing an amazing job in very difficult circumstances - with a small team, people have worked really hard to do many roles at once, and I have nothing but admiration for everyone who's gone before and all the people I've worked with for the past two years. People are awesome, and I really admire those who've done the impossible for so long. But this is not sustainable in the long term.
blueraccoon: (Default)

[personal profile] blueraccoon 2011-11-15 07:14 pm (UTC)(link)
Again, depending on election results, I'd be willing to sit down with someone and work out a plausible incident management process for things like code rollout and such, if you think it's warranted. If you want to talk more about it, you can PM me, or email me at blueraccoon at gmail dot com.

(Also, if you need someone to vouch for me, you can always ask [personal profile] sanders about her iWife.)