Natural Language Processing with Clojure

Web Technology   Digital Transformation Technology Artificial Intelligence Technology   Natural Language Processing Technology   Semantic Web Technology   Deep Learning Technology   Online Learning & Reinforcement Learning Technology  Chatbot and Q&A Technology   User Interface Technology   Knowledge Information Processing Technology   Reasoning Technology  Clojure Programming

I would like to discuss natural language processing using Clojure.

First of all, if you want to do all the processing in English, you can use clojure-openNLP, a wrapper for apache’s openNLP, and add [clojure -opennlp “0.5.0”] to the :dependencies part of the project.clj file in the folder generated by “clojure new my-app (any app name)”. Then, in the core.clj file in the src folder

(use 'clojure.pprint) 
(use 'opennlp.nlp)
(use 'opennlp.treebank)

to be able to use the openNLP library configured in

http://opennlp.sourceforge.net/models-1.5 Download the model file from and copy it under the model file and set it in the code.

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize (make-tokenizer "models/en-token.bin"))
(def detokenize (make-detokenizer "models/english-detokenizer.xml"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def name-find (make-name-finder "models/namefind/en-ner-person.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

Now you can use it. For example, word segmentation (tokenize) can be used with

(def tokenize (make-tokenizer my-tokenizer-model))
(pprint (tokenize "Mr. Smith gave a car to his son on Friday")) ["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",  "Friday"]

The functions are as follows. The functions include “name-find” to find names in NER (named entity recognition), “pos-tag” to find parts of speech, and “chunker” to find word sequences.

In the case of Japanese, morphological analysis can be done with lucine’s JapaneseAnalyzer, a search engine. As a library, we will use the java library. As mentioned above, add [org.apache.lucene/lucene-analyzers-kuromoji “5.0.0”] to the :dependencies part of the project.clj file, and add the following to the core.clj file

(:import 
(org.apache.lucene.analysis.ja JapaneseAnalyzer JapaneseTokenizer)
(org.apache.lucene.analysis.ja.tokenattributes PartOfSpeechAttribute)
(org.apache.lucene.analysis.tokenattributes CharTermAttribute OffsetAttribute)
(org.apache.lucene.analysis.util CharArraySet))

Add The following functions can be used to perform morphological analysis.

(defn morphological-analysis
[src]
(let [analyzer (JapaneseAnalyzer. nil
JapaneseTokenizer/DEFAULT_MODE
CharArraySet/EMPTY_SET
#{})
rdr (StringReader. src)]
(with-open [ts (.tokenStream analyzer "field" rdr)]
(let [^OffsetAttribute offsetAtt (.addAttribute ts OffsetAttribute)
^PartOfSpeechAttribute posAtt (.addAttribute ts PartOfSpeechAttribute)
_ (.reset ts)
surface #(subs src (.startOffset offsetAtt) (.endOffset offsetAtt))
pos #(.getPartOfSpeech posAtt)
tokens (->> #(if (.incrementToken ts)
[(surface) (pos)]
nil)
repeatedly
(take-while identity)
doall)
_ (.end ts)]
tokens))))

Using this function (morphological-analysis “text sentence”), you can get the result of morphological analysis of the input text sentence.

You can also use JUMAN/KNP to perform take-home processing. There is also a Clojure wrapper for JUMAN/KNP, but as an irregular usage, we will refer to terminal operations from the program. First, install juman and knp with homebrew (brew install juman) (brew install knp), and use [me.raynes/conch “0.8.0”] as the library for terminal operations. The code will look like the following.

(:require [me.raynes.conch :as sh])

(defn juman-parse [s](sh/with-programs [juman] (juman {:in s})))

From the words extracted in this way, machine learning can be performed by removing unnecessary words (stop-word removal) and vectorization (one-hot-vector).

In the next article, I would like to discuss machine learning techniques.。

コメント

タイトルとURLをコピーしました