PyThaiNLP · bact · Nov 10, 2018 · Nov 9, 2018 · Nov 9, 2018 · Nov 9, 2018
diff --git a/.travis.yml b/.travis.yml
@@ -4,10 +4,16 @@
 language: python
 python:
   - "3.6"
+
+# workaround to make boto work on travis
+# from https://github.com/travis-ci/travis-ci/issues/7940
+before_install:
+  - sudo rm -f /etc/boto.cfg
+
 # command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
 install:
   - pip install -r requirements.txt
-  - pip install .[icu,ner,pos,tokenize,transliterate]
+  - pip install .[icu,ipa,ner,thai2vec]
   - pip install coveralls
 
 os:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -23,8 +23,8 @@ We use the famous [gitflow](http://nvie.com/posts/a-successful-git-branching-mod
 - Write tests for your new features (please see "Tests" topic below);
 - Always remember that [commented code is dead
   code](http://www.codinghorror.com/blog/2008/07/coding-without-comments.html);
-- Name identifiers (variables, classes, functions, module names) with readable
-  names (`x` is always wrong);
+- Name identifiers (variables, classes, functions, module names) with meaningful
+  and pronounceable names (`x` is always wrong);
 - When manipulating strings, use [Python's new-style
   formatting](http://docs.python.org/library/string.html#format-string-syntax)
   (`'{} = {}'.format(a, b)` instead of `'%s = %s' % (a, b)`);
@@ -55,7 +55,7 @@ Happy hacking! (;
 ## newmm (onecut), mm, TCC, and Thai Soundex Code
 - Korakot Chaovavanich
 
-## Thai2Vec & ulmfit
+## Thai2Vec & ULMFiT
 - Charin Polpanumas
 
 ## Docs

diff --git a/README-pypi.md b/README-pypi.md
@@ -10,20 +10,14 @@
 
 PyThaiNLP is a Python library for natural language processing (NLP) of Thai language.
 
-PyThaiNLP features include Thai word and subword segmentations, soundex, romanization, part-of-speech taggers, and spelling corrections.
-
-## What's new in version 1.7 ?
-
-- Deprecate Python 2 support. (Python 2 compatibility code will be completely dropped in PyThaiNLP 1.8)
-- Refactor pythainlp.tokenize.pyicu for readability
-- Add Thai NER model to pythainlp.ner
-- thai2vec v0.2 - larger vocab, benchmarking results on Wongnai dataset
-- Sentiment classifier based on ULMFit and various product review datasets
-- Add ULMFit utility to PyThaiNLP
-- Add Thai romanization model ThaiTransliterator
-- Retrain POS-tagging model
-- Improved word_tokenize (newmm, mm) and dict_word_tokenize
-- Documentation added
+PyThaiNLP includes Thai word tokenizers, transliterators, soundex converters, part-of-speech taggers, and spell checkers.
+
+## What's new in version 1.8 ?
+
+- New NorvigSpellChecker spell checker class, which can be initialized with custom dictionary.
+- Terminate Python 2 support. Remove all Python 2 compatibility code.
+- Remove old, obsolated, deprecated, and experimental code.
+- see [PyThaiNLP 1.8 change log](https://github.com/PyThaiNLP/pythainlp/issues/118)
 
 ## Install
 

diff --git a/README.md b/README.md
@@ -12,9 +12,9 @@ Thai Natural Language Processing in Python.
 
 PyThaiNLP is a Python package for text processing and linguistic analysis, similar to `nltk` but with focus on Thai language.
 
-PyThaiNLP supports Python 3.4+.
-Since version 1.7, PyThaiNLP deprecates its support for Python 2. The future PyThaiNLP 1.8 will completely drop all supports for Python 2.
-Python 2 users can still use PyThaiNLP 1.6.
+PyThaiNLP 1.8 supports Python 3.6+. Some functions may work with older version of Python 3, but it is not well-tested and will not be supported. See [PyThaiNLP 1.8 change log](https://github.com/PyThaiNLP/pythainlp/issues/118).
+
+Python 2 users can use PyThaiNLP 1.6, our latest released that tested with Python 2.7.
 
 **This is a document for development branch (post 1.7.x). Things will break. For a document for stable branch, see [master](https://github.com/PyThaiNLP/pythainlp/tree/master).**
 
@@ -34,21 +34,40 @@ Python 2 users can still use PyThaiNLP 1.6.
 
 ## Installation
 
-**Using pip**
+PyThaiNLP uses PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/
+
+### Stable release
 
-Stable release
+Standard installation:
 
 ```sh
 $ pip install pythainlp
 ```
 
-Development release
+For some advanced functionalities, like word vector, extra packages may be needed. Install them with these options during pip install:
 
 ```sh
-$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
+$ pip install pythainlp[extra1,extra2,...]
 ```
 
-Note: PyTorch is required for ulmfit sentiment analyser. ```pip install torch``` is needed for the feature. gensim and keras packages may also needed for other modules that rely on these machine learning libraries.
+where ```extras``` can be
+  - ```artagger``` (to support artagger part-of-speech tagger)
+  - ```deepcut``` (to support deepcut machine-learnt tokenizer)
+  - ```icu``` (for ICU support in transliteration and tokenization)
+  - ```ipa``` (for International Phonetic Alphabet support in transliteration)
+  - ```ml``` (to support ULMFit models, like one for sentiment analyser)
+  - ```ner``` (for named-entity recognizer)
+  - ```thai2rom``` (for machine-learnt romanization)
+  - ```thai2vec``` (for Thai word vector)
+  - ```full``` (install everything)
+
+see ```extras``` and ```extras_require``` in [```setup.py```](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py) for details.
+
+Development release:
+
+```sh
+$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
+```
 
 ## Documentation
 

diff --git a/appveyor.yml b/appveyor.yml
@@ -32,7 +32,7 @@ install:
   # - "set ICU_VERSION=62"
   - "%PYTHON%/python.exe -m pip install --upgrade pip"
   - "%PYTHON%/python.exe -m pip install %PYICU_WHEEL%"
-  - "%PYTHON%/python.exe -m pip install -e .[icu,ner,pos,tokenize,transliterate]"
+  - "%PYTHON%/python.exe -m pip install -e .[icu,ipa,ner,thai2vec]"
 
 test_script:
   - "%PYTHON%/python.exe -m pip --version"

diff --git a/pythainlp/number/wordtonum.py b/pythainlp/number/wordtonum.py
@@ -40,11 +40,11 @@
 
 
 def _thaiword_to_num(tokens):
-    len_tokens = len(tokens)
-
-    if len_tokens == 0:
+    if not tokens:
         return None
 
+    len_tokens = len(tokens)
+
     if len_tokens == 1:
         return _THAI_INT_MAP[tokens[0]]
 

diff --git a/pythainlp/sentiment/ulmfit_sent.py b/pythainlp/sentiment/ulmfit_sent.py
@@ -15,6 +15,8 @@
 
 # from fastai.text import multiBatchRNN
 
+__all__ = ["about", "get_sentiment"]
+
 MODEL_NAME = "sent_model"
 ITOS_NAME = "itos_sent"
 
@@ -29,24 +31,26 @@ def get_path(fname):
 
 
 # load model
-model = torch.load(get_path(MODEL_NAME))
-model.eval()
+MODEL = torch.load(get_path(MODEL_NAME))
+MODEL.eval()
 
 # load itos and stoi
 itos = pickle.load(open(get_path(ITOS_NAME), "rb"))
 stoi = defaultdict(lambda: 0, {v: k for k, v in enumerate(itos)})
 
+
 # get sentiment; 1 for positive and 0 for negative
 # or score if specified return_score=True
-softmax = lambda x: np.exp(x) / np.sum(np.exp(x))
+def softmax(x):
+    return np.exp(x) / np.sum(np.exp(x))
 
 
 def get_sentiment(text, return_score=False):
     words = word_tokenize(text)
     tensor = LongTensor([stoi[word] for word in words]).view(-1, 1).cpu()
     tensor = Variable(tensor, volatile=False)
-    model.reset()
-    pred, *_ = model(tensor)
+    MODEL.reset()
+    pred, *_ = MODEL(tensor)
     result = pred.data.cpu().numpy().reshape(-1)
 
     if return_score:

diff --git a/pythainlp/tag/__init__.py b/pythainlp/tag/__init__.py
@@ -20,21 +20,30 @@ def pos_tag(words, engine="unigram", corpus="orchid"):
         * pud - Parallel Universal Dependencies (PUD) treebanks
     :return: returns a list of labels regarding which part of speech it is
     """
+    if not words:
+        return []
+
     if engine == "perceptron":
-        from .perceptron import tag as _tag
+        from .perceptron import tag as tag_
     elif engine == "artagger":
 
-        def _tag(text, corpus=None):
+        def tag_(words, corpus=None):
+            if not words:
+                return []
+
             from artagger import Tagger
-            words = Tagger().tag(" ".join(text))
+            words_ = Tagger().tag(" ".join(words))
 
-            return [(word.word, word.tag) for word in words]
+            return [(word.word, word.tag) for word in words_]
 
     else:  # default, use "unigram" ("old") engine
-        from .unigram import tag as _tag
+        from .unigram import tag as tag_
 
-    return _tag(words, corpus=corpus)
+    return tag_(words, corpus=corpus)
 
 
 def pos_tag_sents(sentences, engine="unigram", corpus="orchid"):
+    if not sentences:
+        return []
+
     return [pos_tag(sent, engine=engine, corpus=corpus) for sent in sentences]
diff --git a/pythainlp/tag/perceptron.py b/pythainlp/tag/perceptron.py
@@ -7,28 +7,33 @@
 import dill
 from pythainlp.corpus import CORPUS_PATH
 
+_ORCHID_DATA_FILENAME = "orchid_pt_tagger.dill"
+_PUD_DATA_FILENAME = "ud_thai_pud_pt_tagger.dill"
 
-def orchid_data():
-    data_filename = os.path.join(CORPUS_PATH, "orchid_pt_tagger.dill")
+
+def _load_tagger(filename):
+    data_filename = os.path.join(CORPUS_PATH, filename)
     with open(data_filename, "rb") as fh:
         model = dill.load(fh)
     return model
 
 
-def pud_data():
-    data_filename = os.path.join(CORPUS_PATH, "ud_thai_pud_pt_tagger.dill")
-    with open(data_filename, "rb") as fh:
-        model = dill.load(fh)
-    return model
+_ORCHID_TAGGER = _load_tagger(_ORCHID_DATA_FILENAME)
+_PUD_TAGGER = _load_tagger(_PUD_DATA_FILENAME)
 
 
-def tag(text, corpus="pud"):
+def tag(words, corpus="pud"):
     """
     รับค่าเป็น ''list'' คืนค่าเป็น ''list'' เช่น [('คำ', 'ชนิดคำ'), ('คำ', 'ชนิดคำ'), ...]
     """
+    if not words:
+        return []
+
+    words = [word.strip() for word in words if word.strip()]
+
     if corpus == "orchid":
-        tagger = orchid_data()
+        tagger = _ORCHID_TAGGER
     else:  # default, use "pud" as a corpus
-        tagger = pud_data()
+        tagger = _PUD_TAGGER
 
-    return tagger.tag(text)
+    return tagger.tag(words)
diff --git a/pythainlp/tag/unigram.py b/pythainlp/tag/unigram.py
@@ -15,26 +15,29 @@
 _THAI_POS_PUD_PATH = os.path.join(CORPUS_PATH, _THAI_POS_PUD_FILENAME)
 
 
-def orchid_data():
+def _orchid_tagger():
     with open(_THAI_POS_ORCHID_PATH, encoding="utf-8-sig") as f:
         model = json.load(f)
     return model
 
 
-def pud_data():
+def _pud_tagger():
     with open(_THAI_POS_PUD_PATH, "rb") as handle:
         model = dill.load(handle)
     return model
 
 
-def tag(text, corpus):
+def tag(words, corpus):
     """
     รับค่าเป็น ''list'' คืนค่าเป็น ''list'' เช่น [('คำ', 'ชนิดคำ'), ('คำ', 'ชนิดคำ'), ...]
     """
+    if not words:
+        return []
+
     if corpus == "orchid":
-        tagger = nltk.tag.UnigramTagger(model=orchid_data())
-        return tagger.tag(text)
+        tagger = nltk.tag.UnigramTagger(model=_orchid_tagger())
+        return tagger.tag(words)
 
     # default, use "pud" as a corpus
-    tagger = pud_data()
-    return tagger.tag(text)
+    tagger = _pud_tagger()
+    return tagger.tag(words)