How to use OpenNLP, an OSS for natural language processing

OpenNLP

検索技術:Search Technology

2023.03.13 2021.04.29

Machine Learning Digital Transformation Artificial Intelligence Probabilistic Generative Model Deep Learning Natural Language Processing Navigation of this blog

OpenNLP

Apache OpenNLP is an open source product maintained under the Apache Software Foundation, and is a set of supervised learning tools for natural language processing. It provides the following basic functions: Language Detector, Sentence Detector, Tokenizer, Name Finder, Document Categorizer, Part-of-Speech Tagger, Lemmatizer, Chunker, Parser, etc. It has almost all the basic natural language processing tools such as Chunker, Parser, and so on.

The older versions did not support Japanese, but the newer version 1.9.0 now officially supports Japanese.

Language Detector is a function that automatically detects “what language” a text written in natural language is written in, and covers 103 languages including Japanese, English, German, French, Russian, Arabic, Chinese, and Korean.

Sentence Detector cuts out sentences from a text and can be trained using the sentence detection training tool.

Tokenizer is a tool that splits words, punctuation marks, numbers, etc. into tokens, which can be trained using the same training tool as above, and kuromoji (a morphological analysis tool), which is also used as a plugin for lucine, is a similar tool for Apache.

Name Finder, also known as a proper name extractor, extracts proper nouns such as the names of people, places, and organizations in text written in natural language with their attributes (proper noun types).

固有名抽出例

Various proper noun types can be used depending on the application, from common ones such as names of people to names of diseases, cuisines, and events, but the corpus must be prepared independently.

The Document Categorizer is a function that automatically assigns classification labels to documents written in natural language. For example, a website that allows users to post documents can automatically assign labels such as “sports,” “entertainment,” “politics,” and “economics” to the posted documents.

The simplest way to use OpenNLP is to download OpenNLP 1.9.0 from the following site, select apache-opennlp-1.9.0-bin.tar.gz, and click

Download - Apache OpenNLP

Apache OpenNLP is a machine learning based toolkit for the processing of natural language text.

Once downloaded, unzip the file, go to the directory created by the unzipping, and specify a script file and model to run on the command line.

Alternatively, you can use Java or a Java compatible program (Clojure, Scala, etc.) to run it.

public class NamedEntitySample {
 
  public static void main(String[] args) throws Exception {
    String SRC = "故障 者 リスト 入り し て い た エンゼルス ・ 大谷 翔 平 投手 （ ２ ３ ） が 戦列 復帰 。";
    String[] sentence = SRC.split("\s+");
 
    try (InputStream modelIn = new FileInputStream("ja-ner.bin")){  /* 1 */
      TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
      NameFinderME nameFinder = new NameFinderME(model);            /* 2 */
      Span[] spans = nameFinder.find(sentence);                     /* 3 */
      for(Span span: spans){
        System.out.printf("Span(%d,%d,%s)="%s"n", span.getStart(), span.getEnd(), span.getType(), str(sentence, span));
      }
    }
  }
 
  static String str(String[] sentence, Span span){
    StringBuilder sb = new StringBuilder();
    for(int i = span.getStart(); i < span.getEnd(); i++){
      sb.append(sentence[i]);
    }
    return sb.toString();
  }
}

The functions of unique name extraction and document classification can be combined with search technology.

Deux Ex Machina

AIシステム設計・意思決定構造の設計を専門としています。
Ontology・DSL・Behavior Treeによる判断の外部化、マルチエージェント構築に取り組んでいます。

Specialized in AI system design and decision-making architecture.
Focused on externalizing decision logic using Ontology, DSL, and Behavior Trees, and building multi-agent systems.