Natural Language Processing with Clojure

I would like to discuss natural language processing using Clojure.

First of all, if you want to do all the processing in English, you can use clojure-openNLP, a wrapper for apache’s openNLP, and add [clojure -opennlp “0.5.0”] to the :dependencies part of the project.clj file in the folder generated by “clojure new my-app (any app name)”. Then, in the core.clj file in the src folder

(use 'clojure.pprint) 
(use 'opennlp.nlp)
(use 'opennlp.treebank)

to be able to use the openNLP library configured in Download the model file from and copy it under the model file and set it in the code.

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize (make-tokenizer "models/en-token.bin"))
(def detokenize (make-detokenizer "models/english-detokenizer.xml"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def name-find (make-name-finder "models/namefind/en-ner-person.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

Now you can use it. For example, word segmentation (tokenize) can be used with

(def tokenize (make-tokenizer my-tokenizer-model))
(pprint (tokenize "Mr. Smith gave a car to his son on Friday")) ["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",  "Friday"]

The functions are as follows. The functions include “name-find” to find names in NER (named entity recognition), “pos-tag” to find parts of speech, and “chunker” to find word sequences.

In the case of Japanese, morphological analysis can be done with lucine’s JapaneseAnalyzer, a search engine. As a library, we will use the java library. As mentioned above, add [org.apache.lucene/lucene-analyzers-kuromoji “5.0.0”] to the :dependencies part of the project.clj file, and add the following to the core.clj file

(org.apache.lucene.analysis.ja JapaneseAnalyzer JapaneseTokenizer)
(org.apache.lucene.analysis.ja.tokenattributes PartOfSpeechAttribute)
(org.apache.lucene.analysis.tokenattributes CharTermAttribute OffsetAttribute)
(org.apache.lucene.analysis.util CharArraySet))

Add The following functions can be used to perform morphological analysis.

(defn morphological-analysis
(let [analyzer (JapaneseAnalyzer. nil
rdr (StringReader. src)]
(with-open [ts (.tokenStream analyzer "field" rdr)]
(let [^OffsetAttribute offsetAtt (.addAttribute ts OffsetAttribute)
^PartOfSpeechAttribute posAtt (.addAttribute ts PartOfSpeechAttribute)
_ (.reset ts)
surface #(subs src (.startOffset offsetAtt) (.endOffset offsetAtt))
pos #(.getPartOfSpeech posAtt)
tokens (->> #(if (.incrementToken ts)
[(surface) (pos)]
(take-while identity)
_ (.end ts)]

Using this function (morphological-analysis “text sentence”), you can get the result of morphological analysis of the input text sentence.

You can also use JUMAN/KNP to perform take-home processing. There is also a Clojure wrapper for JUMAN/KNP, but as an irregular usage, we will refer to terminal operations from the program. First, install juman and knp with homebrew (brew install juman) (brew install knp), and use [me.raynes/conch “0.8.0”] as the library for terminal operations. The code will look like the following.

(:require [me.raynes.conch :as sh])

(defn juman-parse [s](sh/with-programs [juman] (juman {:in s})))

From the words extracted in this way, machine learning can be performed by removing unnecessary words (stop-word removal) and vectorization (one-hot-vector).

In the next article, I would like to discuss machine learning techniques.。


