Introduction to natural language processing, linguistic analysis, machine learning, and other tools

Machine Learning Digital Transformation Artificial Intelligence Probabilistic Generative Model Deep Learning Natural Language Processing Navigation of this blog

Introduction to various tools for natural language processing

In the previous article, we gave an overview of natural language processing. In this article, we will discuss the various tools that are essential for the use of natural language processing.

As for processing tools for processing raw text, there are data cleansing tools such as openrefine and similarity evaluation tools.

Other OSS tools are as follows.

1.Voice Recognition Tools

CMU Sphinx, a widely-used speech recognition program.

Juicer (a speech recognition decoder using weighted finite state transducers)

Julius, an open-source, high-performance, general-purpose, large-vocabulary continuous speech recognition engine for research and development of speech recognition systems.

2.language model

IRSTLM(Language model training and storage tool)

kenlm(memory efficient and speedy language model retention tool)

Kylm (a language model toolkit with features such as weighted finite state transducer output and character-based modeling of unknown words, implemented in Java)

RandLM (a toolkit that uses Bloom Filter, a randomly selected data structure, to maintain large numbers of language models in a small amount of memory)

SRILM (An efficient n-gram language modeling toolkit. It includes various smoothing methods (Kneser-Ney, etc.), class language models, interpolation of multiple models, etc.)

3.language processing library

NLTK (a language processing library written in Python)

OpenNLP (a general language processing library written in Java)Reference Article

Stanford CoreNLP (a library including NLP tools created at Stanford University)

4.reading estimate

KyTea (a complete text analysis toolkit for word segmentation and pronunciation estimation)

mpaligner (a tool for aligning the character-pronunciation correspondences needed when learning a pronunciation estimation system)

Phonetisaurus (a notation-to-pronunciation conversion toolkit based on WFST)

5.phrase structure analysis

Berkeley Parser (PCFG parser. With models for English, Arabic, Chinese, French, German, Bulgarian, etc.)

Charniak Parser (CFG parser for English)

Egret (A probabilistic regular grammar (PCFG) parser that can output compressed forests and n-best lists.

EVALB (a script to evaluate the results of phrase structure analysis)

Stanford Parser (A parser that performs CFG parsing and clause parsing simultaneously. Models for English, Chinese, Arabic, French, and German are available.)

6.morphological analysis

Chasen (Morphological analyzer using HMM)

JUMAN (Morphological analyzer for Japanese. It adds a variety of semantic information in addition to parts of speech.)Reference Articles

KyTea is a field-adaptable morphological analyzer that is robust to unknown words.

MeCab (a morphological analyzer that uses conditional probability fields (CRFs)).

Sen, a morphological analyzer written in Java.

Sudachi(engine by Works Applications, recently released, so it has new dictionaries and features high speed)

7.finite-state model

Kyfd(A decoder for text processing systems built with weighted finite state transducers (WFST).

OpenFST(A library that implements various algorithms of weighted finite state transducers (WFST). Useful for building systems that use finite state models)

8. Machine Translation Alignment

Berkeley Aligner (a word alignment program that implements both supervised and unsupervised alignment)

GIZA++ (a standard word alignment tool that implements the IBM model)

pialign, a phrase alignment tool based on Inversion transduction grammar (ITG). It features a compact model that can be trained while maintaining accuracy.)

9.Machine Translation Decoder

cdec (a decoder that implements recent work on tree-based and forest-based machine translation)

Joshua (a decoder for syntax-based translation)

Moses (a standard machine translation decoder. It supports phrase-based and tree-based machine translation.)

Travatar, a tree-to-string decoder for translation using syntactic information.

10.Machine Translation Evaluation

METEOR (a tool for calculating the METEOR rating scale considering synonyms, stemming normalization, permutation information, etc.)

multeval (a tool for evaluating machine translation based on multiple evaluation criteria and considering statistical significance)

RIBES (a program to evaluate the accuracy of reordering of machine translation results)

11.Machine Learning

AROW++ (Adaptive Regularization of Weight Vectors, an implementation of an online noise-robust classifier)

Classias(A library that implements various classifiers based on online and batch learning.)

CRF++(A toolkit for conditional random fields (CRFs) used in series analysis. Easy to specify feature templates, which is useful for experimenting with various features)

CRFsuite (An implementation of conditional random fields (CRFs) that provides fast learning)

LIBLINEAR(A library that implements classifiers such as linear SVM and logistic regression. Learning is very fast)

LIBSVM (An SVM learning tool that supports various options)

Mallet (A machine learning toolkit for natural language processing. Features include Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM), Conditional Random Fields (CRF), etc. Implemented in JAVA.)

SVM-Light (an efficient SVM library)

Weka (a machine learning library that implements a variety of learning algorithms)

12.Case Analysis

CaboCha (a Japanese clause analyzer based on Cascaded Chunking)

KNP (Japanese clause and case analyzer)Reference Articles

MaltParser (a dependency parser based on Shift-Reduce).

MSTParser (a query parser based on minimal maximal trees)

In the next article, I would like to introduce the actual use of some of these tools.