Overview of openrefine, a natural language processing tool used in DX and other applications

Machine Learning Artificial Intelligence Semantic Web Search Technology Web Technology DataBase Technology Ontology Technology Workflow & Services Digital Transformation UI and Data Visualization Natural Language Processing Noise Removal and Missing Value Interpolation Algorithm Navigation of this blog

Openrefine: Data cleansing tool

In general, when performing machine learning or statistical processing, if the data is corrupted, inaccurate, or irrelevant to the purpose of the processing (so-called garbage data), the results will be inaccurate and the purpose will not be achieved. In order to prevent this from happening, a method called data cleansing is used to process the data accurately and cleanly. This data cleansing process is necessary for the pre-processing of machine learning and the post-processing of natural language processing data that we have done previously.

In the case of string-based data such as natural language processing, this data cleansing process is a normalization process to find and normalize duplicates, misspellings, etc., which represent the same meaning but have slightly different notations and are not considered to be the same. Common examples include the difference between double-byte characters and half-byte characters, the presence or absence of spaces and punctuation marks, errors in variants of people’s names, the division and merging of family names, the notation of corporate names (e.g., the difference between a stock company and a corporation), and the notation of addresses and telephone numbers. In this way, notation rules can be set for each of them, and corrections and deletions can be made.

One of the tools for these data cleansing is “openrefine”. This is an open source tool that was owned by google under the name googlerefine, which was transferred to an open source project in 2012 and renamed OpenRefine, However, (1) the input and output data varies widely, including TSV, CSV, EXcel, HTML, XML, JSON, etc. (2) Since Clojure and other codes can be used in internal processing, data conversion can be performed more flexibly. (3) It has features such as clustering based on the edit distance of strings. You can download the appropriate tool for your OS from the download page (v3.4.1 as of February 2021). for mac, download openrefine-mac-3.4.1.dmg and move the application from the generated disk image. When the installation is complete, a diamond-shaped app icon will be generated. Clicking on it will automatically launch the browser and start the software. (If the browser does not start up, it will start up on localhost:3333)

To input a file, select a file from the “Select File” button and press “NEXT” to analyze the input data and get a preview screen of the data as shown below.

If the data processing is correct here (data delimitation, head data), pressing the “Create Project” button will take you to the final processing screen as shown below.

For detailed instructions, please refer to “Using OpenRefine” published by Packt. The table of contents includes 1. Diving Into OpenRefine, 2. Analyzing and Fixing Data, 3. You don’t need to read those books. Even if you don’t read those books, the cleansing process itself is not complicated, so you can see how to use it by touching various buttons. Some examples are shown below.

One useful use that Excel does not have is the “facet” screen that is processed and displayed by “column”, “Facet”, and “Text facet”. This not only merges the duplicated characters in the text and displays them side by side, but also collects and displays strings that are close to each other by pressing the “Cluster” button and specifying the method of editing distance. Since these functions can be used not only for strings but also for numbers, etc., you can use this function to perform data cleansing efficiently.

You can also use “column”, “edit cells”, and “transform” to flexibly process individual cells in python or clojure, making it easy to convert data types (numeric, date, time, etc.) and normalize them.

It is capable of various data processing other than those introduced above, and is a tool that allows you to easily perform cleansing processes for machine learning and natural language processing without bothering to write code.