tfidfの概要とClojureでの実装

Clojure

2022.04.14 2021.06.01

機械学習技術　人工知能技術　デジタルトランスフォーメーション技術　　サポートベクトルマシン　自然言語処理技術　 Clojure

自然言語処理の処理の一つにtfidfがある。このtfidfはtf()とidfを組み合わせたものとなる。

tfはterm frequencyの略での直訳すると単語の頻度となり、各文書においてその単語がどくらい出現したかを表す指標となる。このtfが大きい単語は頻度が高く出現する単語となり、その文書の特徴を示すという仮説が立てられる。tfは以下の式で表される。

\(\displaystyle tf=\frac{文書Aにおける単語Xの出現頻度}{文書Aにおける全単語の出現頻度の和} =\frac{n_{t,d}}{\sum_{x}n_{s,d}}\)

idfはinverse document frequencyの略で、直訳すると逆文書頻度となり、その単語が様々な文書に出現すると値が小さくなり、滅多に現れない場合は大きくなるものとなる。このidfが大きいと単語のレアさを表すこととなり、その文書の特徴を表すとの仮説が立てられる。idfは以下の式で表される。

\(\displaystyle idf=log(\frac{全文書数}{単語Xを含む文書数})=log\frac{N}{df(t)}+1 \)

tfidfはこれらを掛け合わせたものとなる。つまりifidf=(単語の出現頻度)*(単語のレア度)となり、その単語がよく出現するほど、その単語がレアなほど大きくなる。

ここで実際にA=[スポーツ野球野球バット]という文書と、B=[サッカーサッカーサッカースポーツゴール]という文書があった時、tf(スポーツ、文書A)*idf(スポーツ)=0.25*1=0.25、tf(野球、文書A)*id(野球)=0.5*2=1、tf(バット、文書A)*idf(バット)=0.25*2=0.5、tf(サッカー、文書B)*idf(スサッカー)=0.6*2=1.2、tf(スポーツ、文書B)*id(スポーツ)=0.2*1=0.2、tf(ゴール、文書B)*idf(バゴール)=0.2*2=0.4となる。これによりAとBに共通している「スポーツ」という単語はtfidfが小さくなり、それぞれの文書にしか存在しないでかつ沢山出現している「野球」や「サッカー」の単語のtfidfは大きい値を示す。

これらをClojureで実装したものはHolger Schauer氏の以下のものがある。

(defn freq
  "Returns a map from distinct items in coll to the number of times
  they appear. Returns a stateful transducer when no collection is provided."
  ([]
   (fn [rf]
     (let [freqm (volatile! {})]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (vswap! freqm update input (fnil inc 0))
          (rf result @freqm))))))
  ([coll]
   (into {} (freq) coll)))

(defn normalize-value [maxfreq curfreq]
  "Augment a frequency value, cf. https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2"
  ;; cf. https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2 or
  ;; http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
  (-> (* 0.6 curfreq)
      (/ maxfreq)
      (+ 0.4)))

(defn tf
  "Returns a map of term frequencies for a sequence of words.
Keyword `normalize` defaults to true, returning an augemented term frequency."
  [wordseq & {:keys [normalize] :or {normalize true}}]
  (let [tfreqs (frequencies wordseq)]
    (if-not normalize
      tfreqs
      (let [maxfreq (val (apply max-key val tfreqs))
            normalize-tf (map (fn [[term freq]]
                                   [term (normalize-value maxfreq freq)]))]
        (into {} normalize-tf tfreqs)))))

(defn tfmap-to-termvector [tf-row terms]
  "Convert tf-row into a vector of frequencies (potentially 0) for all terms in tf-row."
  (reduce (fn [tfvec term]
            (conj tfvec (get tf-row term 0)))
          [] terms))

(defn tf-from-docs [documents]
  "Returns a vector of all terms in documents and the related tf-vector for each document"
  (let [tf-rows (map tf documents)
        terms (vec (into #{} (flatten (map keys tf-rows))))]
    (vector terms
            (pmap #(tfmap-to-termvector % terms) tf-rows))))


(defn idf
  "Returns a map of the inverse document frequency for a sequence of texts (sequence of words)."
  [textseq]
  (let [alltfs (map tf textseq)
        termdoccount (reduce (fn [result tfmap]
                               (reduce (fn [resmap [term _]]
                                           (update resmap term (fnil inc 0)))
                                       result tfmap))
                             {} alltfs)
        doccount (count textseq)]
    (reduce (fn [resmap [term docswithterm]]
              (assoc resmap term
                     (Math/log (/ (+ doccount 1) ; apply smoothing!
                                  (+ docswithterm 1)))))
            {} termdoccount)))

(defn tfidf
  "Returns a sequence of the terms and the tf-idf values for a sequence of texts (sequence of words)."
  [textseq]
  (let [alltfs (pmap tf textseq)
        termdoccount (reduce (fn [result tfmap]
                               (reduce (fn [resmap [term _]]
                                           (update resmap term (fnil inc 0)))
                                       result tfmap))
                             {} alltfs)
        terms (keys termdoccount)
        doccount (count textseq)
        idf (reduce (fn [resmap [term docswithterm]]
              (assoc resmap term
                     (Math/log10 (/ doccount ; Note: no smoothing here!
                                    docswithterm))))
            {} termdoccount)
        matrix (pmap (fn [tfpdoc]
                      (map (fn [term]
                             (* (get tfpdoc term 0)
                                (get idf term 0)))
                           terms))
                    alltfs)]
    [terms matrix]))

(defn normalize-tf-xf
  "Returns a normalization of frequencies (either a single map or a collection of maps). 
Returns a transducer when no collection is provided."
  ([]
   (fn [rf]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (let [newmax (val (apply max-key val input))
                normalize-maxvalue (partial normalize-value newmax)
                normalize-termfreq (juxt key (comp normalize-maxvalue val))]
            (rf result (into {} (map normalize-termfreq input))))))))
  ([freqs]
   (cond
     (map? freqs) (into {} (normalize-tf-xf) [freqs])
     (sequential? freqs) (into {} (normalize-tf-xf) freqs)
     :else (throw (ex-info "Don't know how to normalize non-sequential / non-map like type"
                           {:data freqs})))))

(def norm-tf-xf
  "Transducer that will return normalized frequencies."
  (comp (freq) (normalize-tf-xf)))

(defn tf
  "Returns a map of term frequencies for a sequence of words.
Keyword `normalize` defaults to true, returning an augemented term frequency."
  [wordseq & {:keys [normalize] :or {normalize true}}]
  (if normalize
    (into {} norm-tf-xf wordseq)
    (freq wordseq)))

(defn tf-from-docs-xf
  "Returns a map of terms with the number of documents a term appears in and a list of related tf-vector for each document, sorted according to the terms.
Returns a stateful transducer when no collection is provided."
  ([]
   (fn [rf]
     (let [termdoccount (volatile! (sorted-map))
           tfs (volatile! [])]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (let [newtdcount ; re-compute for each term how many documents contain it
                (reduce (fn [newtdcount term]
                          (update newtdcount term (fnil inc 0))) ; inc #term, even if missing (=0)
                        @termdoccount (keys input))
                termcount (count (keys newtdcount))              ; determine |terms|
                termzeromap (into (sorted-map)                   ; build up a sorted map
                                (zipmap (keys newtdcount)        ; of terms with vectors of
                                        (repeat termcount 0)))   ; length |terms| all set to 0
                currows (map (fn [tfdoc]                         ; re-map all existing tfs
                               (vals (merge termzeromap tfdoc))) ; so that they contains all terms
                             @tfs)                               ; with 0 or the old tf value
                newrow (vals (merge termzeromap input))
                currows (conj currows newrow)]
            (vswap! tfs conj input)
            (vreset! termdoccount newtdcount)
            (rf result {:terms @termdoccount :tfs currows})))))))
  ([coll]
   (into {} (tf-from-docs-xf) (map tf coll))))

(defn idf-xf
  "Returns a map of the inverse document frequency for some documents. Expects the input to be a collection of sorted(!) maps of terms with number of documents the term appears in and a list of term frequencies.
Returns a transducer when called without a collection."
  ([]
   (fn [rf]
     (fn
       ([] (rf))
       ([result] (rf result))
       ([result input]
        (let [doccount (count (:tfs input))
              terms (map (fn [[term docswithterm]]
                           [term {:doccount docswithterm
                                  :idf (Math/log10 (/ doccount docswithterm))}])
                         (:terms input))]
          (rf result {:terms (into (sorted-map) terms) :tfs (:tfs input)}))))))
  ([coll]
   (into {} (idf-xf) coll)))

(defn tfidf-xf
  "Returns a map of the terms, the tf and  tf-idf values for a sequence of texts (sequence of words),
given an input collection of sorted(!) maps of terms/doccount/idf and tf values.
  Returns a transducer if called without a collection."
  ([]
   (fn [rf]
     (fn
       ([] (rf))
       ([result] (rf result))
       ([result input]
        (let [tfidfs
              (map (fn [tfdoc]
                     ;; make use of the fact that the tf values are placed at exactly
                     ;; the same position as their corresponding term in the term vector
                     ;; by mapping over both tf and term vector in parallel
                     (map (fn [tfvalue [term {doccount :doccount idf :idf}]]
                            (* tfvalue idf))
                          tfdoc (:terms input)))
                   (:tfs input))]
          (rf result {:terms (:terms input) :tfs (:tfs input) :tfidfs tfidfs}))))))
  ([coll]
   (into {} (tfidf-xf) coll)))

自然言語処理のベースとなる生テキストの構造化の為の各種手法まとめ | Deus Ex Machina より:

2021年7月9日 12:50 PM

[…] より粒度を細かくするには、より小さな次元を扱うために、コレスポンデンス分析で得られるような縮小された空間を使用し、同様の意味を持つ単語を同じ次元に自動的にマッピングすることができる。このような技術の有名な例は、特異値分解を使用した潜在的な意味のインデックスとして知られている(Deerwester et al. 1990)。 TFIDF (Term frequency-Inverse document fre- quency) (Robertson and Spärck Jones 1976)は、コーパス中の用語の出現頻度を考慮して、文書（bag of word）と用語との関連性を採点するために使用される、非常に一般的な尺度となる。これは通常、類似性の尺度ではなく、ある用語と文書との関連性を評価するものになる。ここでは、文書中の文字列の出現頻度とコーパス全体での頻度を比較することで、文字列に対する部分文字列の関連性を評価するために使用される。(参照:clojureでの実装) […]

返信
DX,AI,MLのベースとなる自然言語処理の概要 | Deus Ex Machina より:

2021年7月15日 4:44 AM

[…] tfidfの概要とClojureによる実装 tfidfの概要とClojureによる実装 […]

返信
AI、ML、DX等に活用可能な関数型プログラミングClojure | Deus Ex Machina より:

2021年7月15日 6:59 AM

[…] tfidfの概要とClojureによる実装 tfidfの概要とClojureによる実装 […]

返信
Clojureでの自然言語処理ツールclojuer-openNLP | Deus Ex Machina より:

2022年4月14日 6:36 PM

[…] このようにして抽出した単語から、不要な単語を抜き出したり(stop-word除去)、ベクトル化(one-hot-vector)を行うことで機械学習が行えるようになる。シンプルな機械学習例としては、検索等で用いられるtfidfを使った単語の頻度抽出等がある。 […]

返信
画像認識システムの概要と実装 | Deus Ex Machina より:

2023年11月27日 4:57 AM

[…] データをベクトル化する方法を選択する。一般的な手法には、Bag-of-Words(BoW)、”tfidfの概要とClojureでの実装“で述べているTF-IDF、”Word2Vec“で述べているWord2Vec、”BERTの概要とアルゴリズ […]

返信
自然言語処理技術を用いてテキスト情報から感情コンテキストを抽出する | Deus Ex Machina より:

2023年12月5日 3:59 AM

[…] モデルに入力するために、テキストデータから数値データを抽出する。具体的には、単語の埋め込み表現（Word Embeddings）やTF-IDF（単語の重要度を評価する手法）を使用して、テキストを数値ベクトルに変換する。単語の埋め込み表現に関しては”自然言語処理を用いた語彙学習について“を、TF-IDFに関しては”tfidfの概要とClojureでの実装“も参照のこと。 […]

返信
Sentiment Lexiconsを用いた統計的手法について | Deus Ex Machina より:

2023年12月19日 4:29 AM

[…] を組み合わせるアプローチでは、単語の重要性を考慮して感情分析を行う。”tfidfの概要とClojureでの実装“で述べているTF-IDF（Term Frequency-Inverse Document Frequency）は、単語の出現頻度 […]

返信
自動要約技術の概要とアルゴリズムおよび実装例について | Deus Ex Machina より:

2024年8月27日 4:29 AM

[…] “tfidfの概要とClojureでの実装“でも述べているTF-IDFは、文中の各単語の出現頻度（Term Frequency）と、その単語が文書集合全体での重要度（Inverse Document Frequency）を組み合わせて文の重要度を評価する手法となる。単語のTF-IDFスコアが高い文が、重要な文として選択される。 […]

返信