Data is often dubbed the new gold, but no label can be more wrong. It makes more sense to think about data as diamonds: highly valuable, but before they are of any use, they need intensive polishing. OpenRefine, the latest incarnation of Google Refine, is specifically designed to help you with this job. Until recently, getting started with OpenRefine was rather hard because the amount of functionality can overwhelm you. This prompted Max De Wilde and myself to write a book that will turn you into an OpenRefine expert.
When touring with the Free Your Metadata team, we learned there was an immense need for a tool that helps people deal with datasets. Something as easy an Excel spreadsheet, yet as powerful as an Access database. Few people knew such a tool exists: OpenRefine (or Google Refine as it was called back then). We saw many jaws drop as we demonstrated live how OpenRefine makes common data tasks—that used to require either a lot of manual work or an IT expert—
OpenRefine is all about large data. (I intentionally don’t say “big”, because nobody is actually sure what that means.) With large datasets, I’m thinking of collection items from a museum, a list of all Olympic medal winners, your 25,000-song music collection, and so on: things that are too large to manage manually. The first feature of OpenRefine is that it allows you to view and analyze your data in flexible ways through facets and filters. Instead of having to inspect everything row by row, OpenRefine shows you the structure in your data.
Next, it lets you detect and clean up those places that are slightly inconsistent. For instance, different winners’ medals might be listed as “G”, “Gold” or “GOLD”, even though all of them identify the same award. Such mistakes can be repaired easily with the cluster functionality. As OpenRefine includes various clustering methods, even variants like “Mike Phelps” and “Phelps, Michael” can be corrected automatically into “Michael Phelps”.
Thanks to OpenRefine’s extensibility, you can turn your data into Linked Data. The athletes in your list can be transformed into links to their Freebase or Wikipedia pages. This connects your dataset to others, so people (and software) can find related data with a simple click. The good thing is: you don’t have to be an expert. OpenRefine does most of the hard work for you.
OpenRefine is and will always be freeware, so you can download it for free.
Mid-September 2013, version 2.6 with tons of improvements will be released. It’s also the first version with the “OpenRefine” label, as the previous version was still called “Google Refine”. On the same day, Using OpenRefine will be published, but you can already pre-order it now. Covered in this book:
- importing data in various formats
- exploring datasets in a matter of seconds
- applying basic and advanced cell transformations
- dealing with multi-valued cells
- creating instantaneous links between datasets
- filtering and partitioning your data easily with regular expressions
- using named-entity extraction on full-text fields to automatically identify topics
- performing advanced data operations with GREL
David Huynh, the original creator of OpenRefine, has written the foreword to this book, telling the story about how OpenRefine came to be.
And the cool thing is: you don’t have to bring your own dataset. Throughout the entire book, you can follow along with the Powerhouse Museum collection, an exciting example that will teach you all the tricks you always wanted to master.