skud: (skud)

Mirrored from Infotropism. You can comment there or here.

A few times on the Growstuff mailing list or IRC channel, someone’s excitedly suggested that we should import data from another CC-licensed data set. Each time, I say, “Trust me, that’s pretty complicated,” but I’ve never actually sat down and explained the full gory details of why.

The following is something I wrote up for our wiki so that I could point people at it next time the subject comes up. I thought it might be interesting to a wider audience, too, so that’s why I’m posting it here.

Importing data is hard.

This is a bit of a rant by Skud, who used to work on Freebase, a large open-licensed data repository which imported data in bulk from a range of sources, including Wikipedia, Netflix, the Open Library Project, and many more. She’s had a lot of experience in this area, and learnt a lot about the weird complications of mass data imports.

The simple case

They have a database. You have a database. Your fields are the same. Their API is easy to use and their license is compatible.

  1. map their fields to ours, eg. their “name” is our “system_name”
  2. import data
  3. PROFIT!

Their fields aren’t the same

What if the fields aren’t quite equivalent? For instance, let’s say they have measurements in imperial and we use metric. We’ll need to have ways to convert them. That’s actually a really simple example. Import incompatibilities are more often at a semantic/ontological level. Growstuff has the idea of “crops” and “varieties” but what if the other database only has “plants” with no distinction? Or what if they have crops and varieties but draw the line somewhere slightly different to where we do? These sorts of incompatibilities are more common than not, and massively complicate any import effort.

Some of their data is bogus

Nothing against that other database — some of everyone’s data is bogus! But we need to check it. What “bogus” means will vary from place to place, but it might be spam entries, duplicate records, simple errors, or it might be cruft from their own broken imports. We need to look carefully at every import and make sure we’re skipping as much of this as possible. And this is largely a manual process, since what the bogosity will never be the same twice. You can do this by sampling, of course, but you still need to look at something on the order of a hundreds of records, and know what you’re looking for. Could you spot a mixed-up scientific name on a randomly chosen herb? I couldn’t.

The stub problem

Let’s say we want to import from a database of plant life that lists 10,000 edible plants and their nutritional content. Growstuff has 300 crops at present. We import everything! Now we have 9,700 pages with nothing but nutritional data. Nobody on Growstuff is using them, they have no pictures, they have no planting data, they have no discussions (except maybe spam comments that nobody cleans up because nobody notices). Our “newest crops” page, usually a source of interest, is now just a wasteland of grey placeholder images.

Should we have imported all 10,000 plants, or just the nutritional data of the 300 we already have? Or something in between? The answer is usually “something in between” — you might want that data if and only if you can get other partial data from other imports to make it more interesting.

The best way to do this is to import the 300 and make a note of the 9700. Then later, you can cross-correlate the notes you’ve made from various data imports and re-import those that have, say, at least 3 useful data sources and a picture. But that’s pretty complicated. (Also, see the discussion of repeated imports, below.)

Don’t forget the license

Let’s assume that their data is licensed compatibly — that means CC-BY-SA or CC-BY in our case, since we’re CC-BY-SA and none of the other clauses (ND, NC) are compatible with us. (Ignoring CC-0 and public domain stuff for now — those don’t need attribution at all.)

So by importing, we have to credit them. Now we need some way to represent that in the database. If we do this at the object level, it’s fairly simple: each thing in the database (crop, etc) has many licensors, each of which includes a name for the work (eg. “Katie’s Plants”), a license (eg. CC-BY), a licensor name (eg. Katie Smith), and a URL to link to the original data.

Now we have to display them on the page. Where? Probably at the bottom somewhere: “Some information on this page came from: Katie’s Plants (that would be a link) — CC-BY Katie Smith; SuperPlantDB under CC-BY-SA SuperPlants Inc; etc.”

The license chaining problem

Now imagine that the data on those sites came from other sites. For instance, let’s say Katie’s Plants previously did an import from Freebase.com, and SuperPlantDB did one from Wikipedia. We not only need to credit Katie’s Plants and SuperPlantDB but also those places.

Some questions to consider:

  • Are those second-degree sources, their licenses, licensors, etc available via the API? When Freebase imported images from Wikimedia Commons, we encountered this problem, because the license metadata had to be scraped from inconsistently-formatted HTML. Getting this wrong leads to complaints from licensors whose licenses we’re violating.
  • Do we know what part of the data on Katie’s Plants was sourced from Freebase? Maybe it was the international names, but we’re importing medicinal uses and not touching that part of her data. Does Katie’s license notice express this? Probably not — there’s no requirement in the CC licenses for the attribution to be at the field level, and our own attribution notices definitely don’t operate at this level of detail. Because we don’t have the detail, this means we end up with attribution inflation: pretty soon, every page on Growstuff has a hundred attributions at the bottom of every page.

Sure, we could just choose not to chain licenses, or to do it in some restricted way… but the moral high road here is to respect everyone’s license and attribution, and besides, if you only attribute some contributors, where do you decide to draw the line?

The infectious NC clause problem

This is a subset of license chaining problems. Let’s say Growstuff (a commercial entity using CC-BY-SA) imports from Katie’s Plants (a non-profit entity using CC-BY-SA) which imports in turn imports from Hippie Herbs (a non-profit entity using CC-BY-NC — note the “non-commercial” clause).

Katie’s fine — she imports from Hippie Herbs’ data with impunity because she’s non-profit. She attributes them on her site, and Hippie Herbs is happy. She doesn’t have to use the same license as them because they don’t have a “SA” (Share Alike) clause.

Now Growstuff comes along and wants to import data from Katie’s Plants. Katie’s Plants is CC-BY which is compatible with Growstuff… but what about the data that originally came from Hippie Herbs? We’re commercial, so we’re not meant to use it.

But how do we tell what’s what? Katie probably doesn’t attribute HH at the level of individual bits of data, so we can’t extract just the ok-for-commercial-use bits.

Basically, if you believe in license chaining (and as I said, it’s definitely the moral high road to take, so I think we should) then you have to be constantly vigilant for the taint of NC-licensed data anywhere in the sprawling tree of ancestors to your data.

What if we already have some data? (the merge case)

The simple case is fine for a green-field import with no existing data, which is described above. But let’s say we’re importing data into an area where we already have some contributions from Growstuff members.

  1. map the fields as before
  2. for each piece of data imported, compare with Growstuff
    1. is Growstuff’s field empty? IMPORT!
    2. are the two the same? no-op!
    3. do they differ?
  3. If they differ, do we trust our own community or the import source? Or do we need to adjudicate?

Let’s say we decide to adjudicate. We now need to build an app to let people vote on which one is “correct” — probably best of three or something like that. Freebase did this (multiple times) and I was involved in some of it. We called them “data games” and had leaderboards for who’d voted the most. We couldn’t get enough throughput, though, and sometimes by the time something had been adjudicated, another community member had edited the field on our site, thus invalidating the whole thing. We ended up paying people in developing countries to churn through these votes for us (we used ODesk, but you could use Amazon’s Mechanical Turk or whatever). However, they needed training, and weren’t cheap — even after all the work of setting up the voting queue, there was still considerable expense.

Do we let people edit the data after import?

This came up quite often with Freebase because sometimes they would import from “authoritative sources” who licensed their work specially to Freebase but didn’t generally have a CC license or an open community editing process. For instance, the time when I was talking to some people from the BBC, and one (an older dude) said, “If we gave you our programme data, we wouldn’t want anyone to edit it because we are the experts on our programmes.” This was pretty silly of course — another, younger BBC dude immediately turned to him and said “Ha ha ha, I’ve got two words for you: Doctor Who.” — but sadly these situations are common when you’re dealing with closed/non-community-based/”authoritative” data sources who don’t understand the power of crowdsourcing information.

But even when dealing with compatibly CC-licensed sources with open developer communities, there can still be some problems around the “authority” of the data and how it’s attributed.

Take the case where Katie’s Plants community have spent heaps of time editing their data and are very proud of it. We import it to Growstuff, then our community looks at it and decides that bits of it are wrong and change it.

Do we leave the license link to Katie’s Plants intact? Most likely yes, because our data has theirs in its DNA, so to speak. But what if we essentially deleted all the data from there? This might happen if, for example, we’d imported a picture from Wikimedia Commons then found that the picture was incorrect or inappropriate, so we blew it away. Now we should probably remove the license note. But how do you tell when data has been completely removed as opposed to modified or built upon?

In the Katie’s Plants example, what if Katie’s high quality medicinal plant information gets mixed up with ($DEITY forbid!) low-quality data from less experienced Growstuff members or from yet another import? What implications does this have for Katie’s site and their reputation? Under the license we’re allowed to mess it up because there is no “No Derivatives” (ND) clause, but socially/culturally they’ll be pretty unhappy if we do, and we can expect some backlash.

Repeat imports

Great news! Katie got a government grant and some fantastic press coverage, and her database has expanded enormously. We want to re-run the import. But now consider this case:

  • Katie’s plants, original: “Tomato – red”
  • Growstuff, original: “Tomato – red, yellow, green, black”

When we first imported, we put it to adjudication and found that Growstuff’s data was better, so we went with that.

Now we re-import, and Katie’s data has changed:

  • Katie’s plants, changed: “Tomato – red, yellow, green, striped”

So of course we put it through adjudication again. The correct answer is probably a union of the two sets.

Now, Katie’s database is growing fast, and so is Growstuff. We want to do a regular import from there — perhaps monthly. But somehow along the way, we’ve ended up with different ideas of tomato colour. Every month, their data is different to ours, and we have to keep re-adjudicating the same question: what colour/s are tomatoes? Boring. Our community is tired of playing the voting game, and/or it’s costing us money with our Mechanical Turk people.

So we decide to implement a check: if nothing’s changed on either side since the last adjudication, leave it. But now we have to implement change tracking, not just on Growstuff, but on Katie’s Plants as well. We need to keep a history of changes for every site we import. This is in addition to the infrastructure we’ve had to build to automatically run imports at regular time intervals.

How do we make our data available in return?

Obviously we have an API for people to access our data under CC-BY-SA. But keep in mind the license-chaining effect: if anyone uses data from Growstuff, they will also be constrained by the licenses of all the data sources we import. We will need to make that license information available in the API alongside our data, and make sure all our API docs and related materials explain the necessity of license chaining.

Take a look at Freebase’s Attribution Policy. They use CC-BY, but because of attribution chaining, they can’t just say that — they need a whole page with a wizard to help people figure out how to attribute something on the site. It’s incomplete, too: Freebase decided that they would only require license chaining for “content” as opposed to “facts” (a complicated issue in itself) which means images Wikipedia-based descriptions. They don’t require chained license information for other data sources. This is dubious in terms of the legality and culture of how Creative Commons works — there’s no really firm guidelines on this, but in my opinion the most moral/ethical stance is to always chain your attributions, and Freebase has chosen otherwise. In the past, this has caused some concern from the owners of other data sources that were imported to Freebase. Even Wikipedians have complained that Freebase doesn’t enforce their Wikipedia attributions strongly enough. This sort of thing can lead to reputation problems, if not legal ones.

Just the facts, ma’am

One final complication. Various courts have ruled that “facts” aren’t copyrightable. For instance, the fact that the crop “Corn” has the scientific name “Zea mays” can’t be copyrighted. Even if you have thousands of these facts all together, they can’t be copyrighted, because they’re not a “creative work”. They’re just a statement of fact.

This actually throws the whole idea of CC-licensing collections of data into doubt. And yet we have nothing better, so we do it anyway.

Some data projects have come up with various justifications for this. For instance, Freebase says that the arrangement of the facts is a creative work — that what’s CC-licensed is their schema. That’s pretty creative in itself! The thing is, none of this has really been tested. And so most open data projects have some kind of Terms of Service which explains what they think the CC-license is for and how it’s meant to be used. These generally say, “By accessing our data via our website or API, you agree to behave as if this CC license applied to it (even if there’s not a very strong legal basis for that outside this TOS).”

The original idea of CC licenses was to stop people having to write their own terms and conditions of use for their work, and standardise in such a way that people could easily re-use creative content. Yet for data projects, we end up having to make up our own TOS just to apply a CC license, and we’re back where we started — having to peer at a bunch of legalese and figure out what the hell it means.

Of course once you get into the complexities of license chaining described above, you now also have TOS chaining — if Growstuff uses Katie’s data under their TOS, and Katie uses Hippie Herbs’ under their TOS, is Growstuff now subject to Hippie Herbs’ TOS? No idea! I am not a lawyer! I don’t want to be one! I just want to make a website about growing food!

Conclusion

Importing data is hard! That doesn’t mean we shouldn’t do it, but we should go into it with an awareness of the potential potholes, and carefully weigh up whether importing something is the best choice for us at any given time.

Final note

Katies Plants, Hippie Herbs, and SuperPlantDB are all made-up examples. Any resemblance to actual open data projects is coincidental. Freebase, Wikimedia Commons, and the BBC are real, though.

skud: (Default)

Mirrored from Infotropism. You can comment there or here.

EDITED TO ADD: Please see this followup post, and subscribe to the saveaussiemusic mailing list if you’re interested in this project.

So I’ve been thinking about this project for a while, and it doesn’t have a name, but I wanted to tell you about it anyway. At least I have my startup-style it’s X for Y pitch: it’s like textfiles.com for Australian indie music.

Tweet by mendel: @Skud "Like what for Australian indie music?" "Like the Web Archive of BBS era text files, for Australian indie music." "The web what?" :D

Yeah, well, let me explain.

For background, I’d better start by saying I was pretty terminally uncool, music-wise, in the 80s and early 90s. My family weren’t big on following popular music, I lived somewhere with no decent record stores, records were priced out of my range, and even at school the kids I hung with weren’t hip enough to make mix tapes of anything much but Top 40 stuff. Despite this, I somehow got exposed to a certain amount of Australian indie and alternative music. I say “somehow” because I honestly don’t know where I heard most of this stuff. I guess 3XY and EON-FM, early on. Later, I listened to a lot of Triple J, and watched Rage.

These days, of course, I get most of my musical knowledge and exposure from the Interwebs, and the availability of digital downloads and information about musicians is really helping me backfill a lot of the older Australian music I wish I’d known better at the time.

Like, for example, The Go-Betweens, a Brisbane indie band that I was only faintly aware of until a few years ago, when Grant McLennan died and many of my friends online were expressing sadness at his passing. Of course I quickly figured out that they were part of the soundtrack of my childhood and teens, I just didn’t know them.

The Go-Betweens were pretty well known, and it’s not hard to find their albums, but a lot of equally important Australian music from the 70s to 90s is no longer readily obtainable. Much of it’s not available for (legal) digital download. In many cases CDs are out of print, or there may never have been a CD release, and the only version is vinyl mouldering in someone’s garage. Even information about older Australian music is hard to find: now-defunct labels and publications don’t have websites, and bands that would otherwise pass Wikipedia’s notability guidelines often don’t have articles because it’s so hard to find sources/citations. Only a handful of hobbyist websites and generous-hearted bloggers are keeping vast swathes of our musical heritage alive.

So why did this happen? Well, obscure music is always hard to find. That’s what makes it obscure. But in Australia even a bunch of pretty well known stuff, stuff I grew up on in my no-hipster-cred-whatsoever suburban youth, is rare as hen’s teeth now. For some reason, music that was released on the Mushroom and Festival labels was particularly likely to have this problem. So I asked around, and learnt that those labels, which had released some of the best music of my adolescence, had been consumed first by News Corp and then by Warner, who didn’t care enough to keep the back-catalogs available. I don’t even know how many smaller labels were caught up in this, but I’m guessing plenty.

(The good news is that this seems to be clearing up a little now. More stuff seems to be available in iTunes since last time I checked, and I hear that Warner recently sold back Flying Nun Records (NZ) to the original owners. So there is hope.)

So here’s what I want to do. I’d like to start a project for people — techies, music nerds, archivists, whoever — to come together and work on projects to preserve and disseminate (information about) Australian music, in as free and open a manner as possible: open source code, creative commons licenses, non-commercial and optimised for maximum sharing and reuse.

First project (something I’ve been meaning to do anyway) is to extract pertinent facts about artists, albums, and labels from a variety of online sources (such as, for example, the archived website of The Encyclopedia of Australian Rock and Pop) and use it to update MusicBrainz (and from there, hundreds of sites and apps that use MusicBrainz’s data).

Then I’d like to make sure that any Australian musical acts that are sufficiently notable have Wikipedia entries. In many cases this will mean grovelling through pre-Internet dead trees publications, but I’m going to be in Australia and probably unemployed through the summer and I hear that libraries have air conditioning and Internet access these days, so that actually sounds quite pleasant. Along the way, I hope to make a resource list for other Australians who’d like to do the same thing: which libraries have useful collections of music periodicals? Who’s got zines or clippings they’ll scan if you contact them? What online archives already exist for you to trawl through? That sort of thing.

Those two projects are pretty simple, but they’re important because free, open-licensed online resources will be the foundation for later projects. I don’t even know what these later projects are, yet; I just know that having the information out there will make them easier.

So, I’ll take a shot at MusicBrainz and Wikipedia regardless of whether anyone else is interested. I suspect that lots of people are interested, though, and that with a sufficient number and variety of participants there are a lot of other, more ambitious things we could try.

So I’m looking for coders, open data nerds, Wikimedians, librarians and archivists, scholars, music journalists, zinesters, fans, broadcasters, copyright law experts, free culture advocates, and past and present musicians, producers, promoters, and label folks who might be interested in this project. I’m planning to set up a mailing list and/or website for it, so leave a comment below with your email address (which will be hidden, not shown to the public) and I’ll let you know when there’s something to join.

Also, still looking for a name. Ideas welcome.

EDITED TO ADD: Please see this followup post, and subscribe to the saveaussiemusic mailing list if you’re interested in this project.


Image credit: the image used on the front page of infotrope.net to link to this post is a collage of clips from Party Fears, a Perth music zine from the 80s-90s now archived online by its creator, David Gerard.

Profile

skud: (Default)
skud

May 2013

S M T W T F S
   1234
567891011
12 131415161718
192021222324 25
262728293031 

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Base style:
Yvonne
Theme:
[personal profile] rising

Expand Cut Tags

No cut tags