Classification in liblinear with Clojure, a natural language processing tool.

Machine Learning Artificial Intelligence Digital Transformation Natural Language Processing Online Learning anomaly and change detection Ontology Technology Image Information Support Vector Machine Clojure Navigation of this blog

Sentence classification using liblinear and natural language processing

In a previous article, I discussed classification using k-means. This time, I would like to discuss natural language processing and sentence system extraction using liblinear.

liblinear is an open-source SVM (support vector machine) developed at National Taiwan University. liblinear is a pattern recognition model that uses value-supervised learning, an algorithm that can be applied to classification and regression. liblinear uses this SVM to It can be used for a variety of tasks, and is capable of fast linear separation of data with a million digits of instances and features.

In this article, I used clj-liblinear, a wrapper for Clojure, in combination with kuromoji, a natural language processing tool, to perform the task of sentence classification.

To use clj-liblinear, add the following to :dependencies in the project.clj file

[clj-liblinear "0.1.0"]

The sample code described in clj-liblinear is as follows.

(use '[clj-liblinear.core :only [train predict]]
     '[clojure.string :only [split lower-case]])

(def facetweets [{:class 0 :text "grr i am so angry at my iphone"}
                 {:class 0 :text "this new movie is terrible"}
                 {:class 0 :text "disappointed that my maximum attention span is 10 seconds"}
                 {:class 0 :text "damn the weather sucks"}

                 {:class 1 :text "sitting in the park in the sun is awesome"}
                 {:class 1 :text "eating a burrito life is super good"}
                 {:class 1 :text "i love weather like this"}
                 {:class 1 :text "great new album from my favorite band"}])

(let [bags-of-words (map #(-> % :text (split #" ") set) facetweets)
      model         (train bags-of-words (map :class facetweets))]
  
  (map #(predict model (into #{} (split % #" ")))
       ["damn it all to hell!"
        "i love everyone"
        "my iphone is super awesome"
        "the weather is terrible this sucks"]))

;; => (0 1 1 0)

The facetweet is a simple module that sets up data specifying sentences and classes, generates bag-of-words and models using the data, learns them, and uses the results to judge the input sentences. In the sample, the input is in English, but Japanese sentences can be used in the same way as English by dividing them into tokens, inserting spaces, and arranging them.

Although there are several modules available for kuromoji, we used a module written in Java, which is used in the text analysis part of the open source search engine Apache Lucene. There are several modules that can be used as kuromoji.

In :dependencies of project.clj, register the modules in java maven as follows

[org.apache.lucene/lucene-analyzers-kuromoji "5.0.0"]

To use this in your Clojure code, specify it with :import as shown below.

(:import (java.io File FileInputStream InputStreamReader BufferedReader StringReader
            BufferedWriter OutputStreamWriter FileOutputStream)
           (org.apache.lucene.analysis.ja JapaneseAnalyzer JapaneseTokenizer)
           (org.apache.lucene.analysis.ja.tokenattributes PartOfSpeechAttribute)
           (org.apache.lucene.analysis.tokenattributes CharTermAttribute OffsetAttribute)
           (org.apache.lucene.analysis.util CharArraySet))

At this time, not only the lucine.kuromoji module but also the java.io module should be imported to enable string manipulation. After that, set up a function to generate a sentence that is broken down into tokens with spaces inserted between them, as shown in the following code.

;;形態素解析関数
(defn morphological-analysis
  [src]
  (let [analyzer (JapaneseAnalyzer. nil
                                    JapaneseTokenizer/DEFAULT_MODE
                                    CharArraySet/EMPTY_SET
                                    #{})
        rdr (StringReader. src)]
    (with-open [ts (.tokenStream analyzer "field" rdr)]
      (let [^OffsetAttribute       offsetAtt (.addAttribute ts OffsetAttribute)
            ^PartOfSpeechAttribute posAtt    (.addAttribute ts PartOfSpeechAttribute)
            _       (.reset ts)
            surface #(subs src (.startOffset offsetAtt) (.endOffset offsetAtt))
            pos     #(.getPartOfSpeech posAtt)
            tokens  (->> #(if (.incrementToken ts)
                            [(surface) (pos)]
                            nil)
                         repeatedly
                         (take-while identity)
                         doall)
            _       (.end ts)]
        tokens))))

;;形態素解析の結果を(品詞,単語)の形に変形
(defn simple-morph [n] (map #(list (first (str/split (% 1) #"-")) (% 0)) (morphological-analysis n)))

;;名詞と動詞を抽出する関数 (名詞、動詞、助詞、形容詞、記号、助動詞)
(defn tokutei-token [n]
  (keep
   #(cond
      (= "名詞" (first %)) (second %)
      (= "動詞" (first %)) (second %)
      :else nil)(simple-morph n)))
(defn tokutei-token2 [n] (str/join #" " (tokutei-token n)))

;;全ての単語のみの抽出
(defn token-all [n](map #(second %)(simple-morph n)))
(defn token-all2 [n] (str/join #" " (token-all n)))

In addition to simply using all the tokens, a version of the function with part-of-speech filtering has been defined. If we add the function of liblinear to this, we can do the processing.

;;教師データ
(def facetweets [{:class 0 :text (token-all2 "今日のタスクは何ですか?")}
                 {:class 0 :text (token-all2 "何の仕事がありますか?")}
                 {:class 0 :text (token-all2 "何をするべきですか?")}
                 {:class 0 :text (token-all2 "今日は何が残っていますか?")}

                 {:class 3 :text (token-all2 "おはようございます")}
                 {:class 3 :text (token-all2 "こんにちは")}
                 {:class 3 :text (token-all2 "今日の天気は何ですか?")}
                 {:class 3 :text (token-all2 "ご機嫌いかが?")}

                 {:class 4 :text (token-all2 "はい")}
                 {:class 4 :text (token-all2 "yes")}
                 {:class 4 :text (token-all2 "お願いします")}
                 {:class 4 :text (token-all2 "YES")}
                 {:class 4 :text (token-all2 "そうですね")}

                 ])
;;分類判定関数
(defn categorize [input]
  (let [bags-of-words (map #(-> % :text (str/split #" ") set) facetweets)
        model         (liblinear/train bags-of-words (map :class facetweets))]
    (map #(liblinear/predict model (into #{} (str/split % #" ")))
         [(token-all2 input)])))

;;(categorize "こんにちは")   ;(1.0)

If we utilize the database introduced in the previous article for storing and retrieving the teacher data, we can create a machine learning tool that corrects the data in real time.