k-means and Clojure

Artificial Intelligence Technology    Machine Learning Technology    Natural Language Processing   Clojure 

Continuing from the previous article, I would like to discuss machine learning using clojure. This time, I will discuss k-means, which is unsupervised learning.

The library used is clustering.

The :dependencies are as follows.

[rm-hull/clustering "0.2.0"]

The official sample code is as follows.

(require '[clustering.core.k-means :as k-means])
(require '[clj-time.core :refer [after? date-time interval in-days])
(require '[clj-time.format :refer [unparse formatters])
(require '[clj-time.coerce :refer [to-long from-long])

(def test-dataset
  (hash-set
    (date-time 2013 7 21)
    (date-time 2013 7 25)
    (date-time 2013 7 14)
    (date-time 2013 7 31)
    (date-time 2013 7 1)
    (date-time 2013 8 3)

    (date-time 2012 12 26)
    (date-time 2012 12 28)
    (date-time 2013 1 16)

    (date-time 2012 6 2)
    (date-time 2012 6 7)
    (date-time 2012 6 6)
    (date-time 2012 6 9)
    (date-time 2012 5 28)))

(defn distance [dt-a dt-b]
  (if (after? dt-a dt-b)
    (distance dt-b dt-a)
    (in-days (interval dt-a dt-b))))

(defn average [dates]
  (from-long
    (/ (reduce + (map to-long dates)) (count dates))))

(def means (k-means/init-means 3 test-dataset)

(def groups (k-means/cluster distance average test-dataset means 0))

(count groups)
; => 3

(map fmt (sort (groups 0)))
; => ("2012-12-26" "2012-12-28" "2013-01-16")

(map fmt (sort (groups 1)))
; => ("2013-07-01" "2013-07-14" "2013-07-21" "2013-07-25" "2013-07-31" "2013-08-03")

(map fmt (sort (groups 2)))
; => ("2012-05-28" "2012-06-02" "2012-06-06" "2012-06-07" "2012-06-09")

In the sample, it is a clustering of date and time information, but we will try to combine this with the results of natural language processing.

For morphological analysis, we will use kuromoji as before.

;;形態素解析関数
(defn morphological-analysis
  [src]
  (let [analyzer (JapaneseAnalyzer. nil
                                    JapaneseTokenizer/DEFAULT_MODE
                                    CharArraySet/EMPTY_SET
                                    #{})
        rdr (StringReader. src)]
    (with-open [ts (.tokenStream analyzer "field" rdr)]
      (let [^OffsetAttribute       offsetAtt (.addAttribute ts OffsetAttribute)
            ^PartOfSpeechAttribute posAtt    (.addAttribute ts PartOfSpeechAttribute)
            _       (.reset ts)
            surface #(subs src (.startOffset offsetAtt) (.endOffset offsetAtt))
            pos     #(.getPartOfSpeech posAtt)
            tokens  (->> #(if (.incrementToken ts)
                            [(surface) (pos)]
                            nil)
                         repeatedly
                         (take-while identity)
                         doall)
            _       (.end ts)]
        tokens))))

The input sentence is now decomposed into token data. After that, we can use clojure’s hash-set function to connect all the token data into a hash-set, and we are ready to learn.

For more information on k-means in python, please refer to “Overview of k-means, its applications, and implementation examples“.

コメント

タイトルとURLをコピーしました