Overview of tfidf and its implementation in Clojure

Artificial Intelligence Technology    Machine Learning Technology    Natural Language Processing   Clojure 

One of the processes in natural language processing is tfidf. This tfidf is a combination of tf() and idf.

tf is an abbreviation for term frequency, which directly translates to the frequency of a word, and is an indicator of how often a word appears in each document. It is hypothesized that a word with a high tf is a word that appears frequently, indicating the characteristics of the document. tf is expressed by the following equation.

\(\displaystyle tf=\frac{文書Aにおける単語Xの出現頻度}{文書Aにおける全単語の出現頻度の和} =\frac{n_{t,d}}{\sum_{x}n_{s,d}}\)

The idf is an abbreviation for inverse document frequency, which means that the value becomes smaller when the word appears in various documents and larger when it rarely appears. It is hypothesized that a large value of idf indicates the rarity of the word, and thus the characteristics of the document. idf is expressed by the following equation.

\(\displaystyle idf=log(\frac{全文書数}{単語Xを含む文書数})=log\frac{N}{df(t)}+1 \)

The tfidf is the sum of these factors. In other words, ifidf=(frequency of word)*(rarity of word), where the more often the word appears and the rarer the word is, the larger the tfidf.

If we have two documents, A=[sports, baseball, baseball bat] and B=[soccer, soccer, sports goal], then tf(sports, document A)*idf(sports) = 0.25*1=0.25, tf(baseball, document A)*id(baseball) = 0.5*2=1, tf(bat, document A)*idf(baseball) Document A)*idf(bat)=0.25*2=0.5, tf(soccer, Document B)*idf(soccer)=0.6*2=1.2, tf(sports, Document B)*idf(sports)=0.2*1=0.2, tf(goal, Document B)*idf(goal)=0.2*2=0.4. As a result, the word “sports”, which is common to both documents A and B, has a small tfidf, while the words “baseball” and “soccer”, which are found only in each document and appear in many documents, have large tfidf values.

The following is Holger Schauer’s implementation of these in Clojure.

(defn freq
  "Returns a map from distinct items in coll to the number of times
  they appear. Returns a stateful transducer when no collection is provided."
  ([]
   (fn [rf]
     (let [freqm (volatile! {})]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (vswap! freqm update input (fnil inc 0))
          (rf result @freqm))))))
  ([coll]
   (into {} (freq) coll)))

(defn normalize-value [maxfreq curfreq]
  "Augment a frequency value, cf. https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2"
  ;; cf. https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2 or
  ;; http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
  (-> (* 0.6 curfreq)
      (/ maxfreq)
      (+ 0.4)))

(defn tf
  "Returns a map of term frequencies for a sequence of words.
Keyword `normalize` defaults to true, returning an augemented term frequency."
  [wordseq & {:keys [normalize] :or {normalize true}}]
  (let [tfreqs (frequencies wordseq)]
    (if-not normalize
      tfreqs
      (let [maxfreq (val (apply max-key val tfreqs))
            normalize-tf (map (fn [[term freq]]
                                   [term (normalize-value maxfreq freq)]))]
        (into {} normalize-tf tfreqs)))))

(defn tfmap-to-termvector [tf-row terms]
  "Convert tf-row into a vector of frequencies (potentially 0) for all terms in tf-row."
  (reduce (fn [tfvec term]
            (conj tfvec (get tf-row term 0)))
          [] terms))

(defn tf-from-docs [documents]
  "Returns a vector of all terms in documents and the related tf-vector for each document"
  (let [tf-rows (map tf documents)
        terms (vec (into #{} (flatten (map keys tf-rows))))]
    (vector terms
            (pmap #(tfmap-to-termvector % terms) tf-rows))))


(defn idf
  "Returns a map of the inverse document frequency for a sequence of texts (sequence of words)."
  [textseq]
  (let [alltfs (map tf textseq)
        termdoccount (reduce (fn [result tfmap]
                               (reduce (fn [resmap [term _]]
                                           (update resmap term (fnil inc 0)))
                                       result tfmap))
                             {} alltfs)
        doccount (count textseq)]
    (reduce (fn [resmap [term docswithterm]]
              (assoc resmap term
                     (Math/log (/ (+ doccount 1) ; apply smoothing!
                                  (+ docswithterm 1)))))
            {} termdoccount)))

(defn tfidf
  "Returns a sequence of the terms and the tf-idf values for a sequence of texts (sequence of words)."
  [textseq]
  (let [alltfs (pmap tf textseq)
        termdoccount (reduce (fn [result tfmap]
                               (reduce (fn [resmap [term _]]
                                           (update resmap term (fnil inc 0)))
                                       result tfmap))
                             {} alltfs)
        terms (keys termdoccount)
        doccount (count textseq)
        idf (reduce (fn [resmap [term docswithterm]]
              (assoc resmap term
                     (Math/log10 (/ doccount ; Note: no smoothing here!
                                    docswithterm))))
            {} termdoccount)
        matrix (pmap (fn [tfpdoc]
                      (map (fn [term]
                             (* (get tfpdoc term 0)
                                (get idf term 0)))
                           terms))
                    alltfs)]
    [terms matrix]))

(defn normalize-tf-xf
  "Returns a normalization of frequencies (either a single map or a collection of maps). 
Returns a transducer when no collection is provided."
  ([]
   (fn [rf]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (let [newmax (val (apply max-key val input))
                normalize-maxvalue (partial normalize-value newmax)
                normalize-termfreq (juxt key (comp normalize-maxvalue val))]
            (rf result (into {} (map normalize-termfreq input))))))))
  ([freqs]
   (cond
     (map? freqs) (into {} (normalize-tf-xf) [freqs])
     (sequential? freqs) (into {} (normalize-tf-xf) freqs)
     :else (throw (ex-info "Don't know how to normalize non-sequential / non-map like type"
                           {:data freqs})))))

(def norm-tf-xf
  "Transducer that will return normalized frequencies."
  (comp (freq) (normalize-tf-xf)))

(defn tf
  "Returns a map of term frequencies for a sequence of words.
Keyword `normalize` defaults to true, returning an augemented term frequency."
  [wordseq & {:keys [normalize] :or {normalize true}}]
  (if normalize
    (into {} norm-tf-xf wordseq)
    (freq wordseq)))

(defn tf-from-docs-xf
  "Returns a map of terms with the number of documents a term appears in and a list of related tf-vector for each document, sorted according to the terms.
Returns a stateful transducer when no collection is provided."
  ([]
   (fn [rf]
     (let [termdoccount (volatile! (sorted-map))
           tfs (volatile! [])]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (let [newtdcount ; re-compute for each term how many documents contain it
                (reduce (fn [newtdcount term]
                          (update newtdcount term (fnil inc 0))) ; inc #term, even if missing (=0)
                        @termdoccount (keys input))
                termcount (count (keys newtdcount))              ; determine |terms|
                termzeromap (into (sorted-map)                   ; build up a sorted map
                                (zipmap (keys newtdcount)        ; of terms with vectors of
                                        (repeat termcount 0)))   ; length |terms| all set to 0
                currows (map (fn [tfdoc]                         ; re-map all existing tfs
                               (vals (merge termzeromap tfdoc))) ; so that they contains all terms
                             @tfs)                               ; with 0 or the old tf value
                newrow (vals (merge termzeromap input))
                currows (conj currows newrow)]
            (vswap! tfs conj input)
            (vreset! termdoccount newtdcount)
            (rf result {:terms @termdoccount :tfs currows})))))))
  ([coll]
   (into {} (tf-from-docs-xf) (map tf coll))))

(defn idf-xf
  "Returns a map of the inverse document frequency for some documents. Expects the input to be a collection of sorted(!) maps of terms with number of documents the term appears in and a list of term frequencies.
Returns a transducer when called without a collection."
  ([]
   (fn [rf]
     (fn
       ([] (rf))
       ([result] (rf result))
       ([result input]
        (let [doccount (count (:tfs input))
              terms (map (fn [[term docswithterm]]
                           [term {:doccount docswithterm
                                  :idf (Math/log10 (/ doccount docswithterm))}])
                         (:terms input))]
          (rf result {:terms (into (sorted-map) terms) :tfs (:tfs input)}))))))
  ([coll]
   (into {} (idf-xf) coll)))

(defn tfidf-xf
  "Returns a map of the terms, the tf and  tf-idf values for a sequence of texts (sequence of words),
given an input collection of sorted(!) maps of terms/doccount/idf and tf values.
  Returns a transducer if called without a collection."
  ([]
   (fn [rf]
     (fn
       ([] (rf))
       ([result] (rf result))
       ([result input]
        (let [tfidfs
              (map (fn [tfdoc]
                     ;; make use of the fact that the tf values are placed at exactly
                     ;; the same position as their corresponding term in the term vector
                     ;; by mapping over both tf and term vector in parallel
                     (map (fn [tfvalue [term {doccount :doccount idf :idf}]]
                            (* tfvalue idf))
                          tfdoc (:terms input)))
                   (:tfs input))]
          (rf result {:terms (:terms input) :tfs (:tfs input) :tfidfs tfidfs}))))))
  ([coll]
   (into {} (tfidf-xf) coll)))

コメント

タイトルとURLをコピーしました