Artificial Intelligence Technology Machine Learning Technology Natural Language Processing Clojure
Continuing from the previous article, I would like to discuss machine learning using clojure. This time, I will discuss k-means, which is unsupervised learning.
The library used is clustering.
The :dependencies are as follows.
[rm-hull/clustering "0.2.0"]
The official sample code is as follows.
(require '[clustering.core.k-means :as k-means])
(require '[clj-time.core :refer [after? date-time interval in-days])
(require '[clj-time.format :refer [unparse formatters])
(require '[clj-time.coerce :refer [to-long from-long])
(def test-dataset
(hash-set
(date-time 2013 7 21)
(date-time 2013 7 25)
(date-time 2013 7 14)
(date-time 2013 7 31)
(date-time 2013 7 1)
(date-time 2013 8 3)
(date-time 2012 12 26)
(date-time 2012 12 28)
(date-time 2013 1 16)
(date-time 2012 6 2)
(date-time 2012 6 7)
(date-time 2012 6 6)
(date-time 2012 6 9)
(date-time 2012 5 28)))
(defn distance [dt-a dt-b]
(if (after? dt-a dt-b)
(distance dt-b dt-a)
(in-days (interval dt-a dt-b))))
(defn average [dates]
(from-long
(/ (reduce + (map to-long dates)) (count dates))))
(def means (k-means/init-means 3 test-dataset)
(def groups (k-means/cluster distance average test-dataset means 0))
(count groups)
; => 3
(map fmt (sort (groups 0)))
; => ("2012-12-26" "2012-12-28" "2013-01-16")
(map fmt (sort (groups 1)))
; => ("2013-07-01" "2013-07-14" "2013-07-21" "2013-07-25" "2013-07-31" "2013-08-03")
(map fmt (sort (groups 2)))
; => ("2012-05-28" "2012-06-02" "2012-06-06" "2012-06-07" "2012-06-09")
In the sample, it is a clustering of date and time information, but we will try to combine this with the results of natural language processing.
For morphological analysis, we will use kuromoji as before.
;;形態素解析関数
(defn morphological-analysis
[src]
(let [analyzer (JapaneseAnalyzer. nil
JapaneseTokenizer/DEFAULT_MODE
CharArraySet/EMPTY_SET
#{})
rdr (StringReader. src)]
(with-open [ts (.tokenStream analyzer "field" rdr)]
(let [^OffsetAttribute offsetAtt (.addAttribute ts OffsetAttribute)
^PartOfSpeechAttribute posAtt (.addAttribute ts PartOfSpeechAttribute)
_ (.reset ts)
surface #(subs src (.startOffset offsetAtt) (.endOffset offsetAtt))
pos #(.getPartOfSpeech posAtt)
tokens (->> #(if (.incrementToken ts)
[(surface) (pos)]
nil)
repeatedly
(take-while identity)
doall)
_ (.end ts)]
tokens))))
The input sentence is now decomposed into token data. After that, we can use clojure’s hash-set function to connect all the token data into a hash-set, and we are ready to learn.
For more information on k-means in python, please refer to “Overview of k-means, its applications, and implementation examples“.
コメント