Clojure stopword removal

Artificial Intelligence Technology    Machine Learning Technology    Natural Language Processing   Clojure 

An implementation in Clojure of stopword removal to remove unnecessary words in natural language processing.

In the case of word vectors, a word dictionary is created as follows

( "5"
"①"
"④"
"30"
"②"
"③"
"2"
"⑤"
"⑥"
...)

Remove a word using the remove function as follows.

(def ex-stopword
  ->>  raw-word-list
       (remove (set (read-string (slurp "data/stopword.txt")))))

There is also the following approach.

(defn load-stopwords [filename]
         (with-open [r (io/reader filename)]
           (set (doall (line-seq r)))))
       (def is-stopword (load-stopwords "stopwords/english"))
(def tokens
         (map #(remove is-stopword (normalize (tokenize %)))
              (get-sentences
                "I never saw a Purple Cow.
                I never hope to see one.
                But I can tell you, anyhow.
                I'd rather see than be one.")))

The above code was adapted from Clojure Data Analysis Cookbook 2nd.

The Python implementation is shown below for reference.

words = ['this', 'is', 'a', 'pen'] 
stop_words = ['is', 'a'] 
changed_words = [word for word in words if word not in stop_words] 
print(changed_words)

コメント

タイトルとURLをコピーしました