Cleaning Data with OpenRefine

Seth van Hooland, Ruben Verborgh, and Max De Wilde

Don’t take your data at face value. That is the key message of this tutorial which focuses on how scholars can diagnose and act upon the accuracy of data. In this lesson, you will learn the principles and practice of data cleaning, as well as how OpenRefine can be used to perform four essential tasks that will help you to clean your data: 1. remove duplicate records; 2. separate multiple values contained in the same field; 3.aAnalyse the distribution of values throughout a data set; 4. group together different representations of the same reality. These steps are illustrated with the help of a series of exercises based on a collection of metadata from the Powerhouse Museum, demonstrating how (semi-)automated methods can help you correct the errors in your data.

Published in 2013 in The Programming Historian.


Seth van Hooland, Ruben Verborgh, and Max De Wilde. 2013. Cleaning Data with OpenRefine. In The Programming Historian, Adam Crymble, Patrick Burns and Nora McGregor (eds.).
