Skip to content

More test cases - reached 80% coverage #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 10, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,16 @@
language: python
python:
- "3.6"

# workaround to make boto work on travis
# from https://github.com/travis-ci/travis-ci/issues/7940
before_install:
- sudo rm -f /etc/boto.cfg

# command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
install:
- pip install -r requirements.txt
- pip install .[icu,ner,pos,tokenize,transliterate]
- pip install .[icu,ipa,ner,thai2vec]
- pip install coveralls

os:
Expand Down
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ We use the famous [gitflow](http://nvie.com/posts/a-successful-git-branching-mod
- Write tests for your new features (please see "Tests" topic below);
- Always remember that [commented code is dead
code](http://www.codinghorror.com/blog/2008/07/coding-without-comments.html);
- Name identifiers (variables, classes, functions, module names) with readable
names (`x` is always wrong);
- Name identifiers (variables, classes, functions, module names) with meaningful
and pronounceable names (`x` is always wrong);
- When manipulating strings, use [Python's new-style
formatting](http://docs.python.org/library/string.html#format-string-syntax)
(`'{} = {}'.format(a, b)` instead of `'%s = %s' % (a, b)`);
Expand Down Expand Up @@ -55,7 +55,7 @@ Happy hacking! (;
## newmm (onecut), mm, TCC, and Thai Soundex Code
- Korakot Chaovavanich

## Thai2Vec & ulmfit
## Thai2Vec & ULMFiT
- Charin Polpanumas

## Docs
Expand Down
22 changes: 8 additions & 14 deletions README-pypi.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,14 @@

PyThaiNLP is a Python library for natural language processing (NLP) of Thai language.

PyThaiNLP features include Thai word and subword segmentations, soundex, romanization, part-of-speech taggers, and spelling corrections.

## What's new in version 1.7 ?

- Deprecate Python 2 support. (Python 2 compatibility code will be completely dropped in PyThaiNLP 1.8)
- Refactor pythainlp.tokenize.pyicu for readability
- Add Thai NER model to pythainlp.ner
- thai2vec v0.2 - larger vocab, benchmarking results on Wongnai dataset
- Sentiment classifier based on ULMFit and various product review datasets
- Add ULMFit utility to PyThaiNLP
- Add Thai romanization model ThaiTransliterator
- Retrain POS-tagging model
- Improved word_tokenize (newmm, mm) and dict_word_tokenize
- Documentation added
PyThaiNLP includes Thai word tokenizers, transliterators, soundex converters, part-of-speech taggers, and spell checkers.

## What's new in version 1.8 ?

- New NorvigSpellChecker spell checker class, which can be initialized with custom dictionary.
- Terminate Python 2 support. Remove all Python 2 compatibility code.
- Remove old, obsolated, deprecated, and experimental code.
- see [PyThaiNLP 1.8 change log](https://github.com/PyThaiNLP/pythainlp/issues/118)

## Install

Expand Down
35 changes: 27 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ Thai Natural Language Processing in Python.

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to `nltk` but with focus on Thai language.

PyThaiNLP supports Python 3.4+.
Since version 1.7, PyThaiNLP deprecates its support for Python 2. The future PyThaiNLP 1.8 will completely drop all supports for Python 2.
Python 2 users can still use PyThaiNLP 1.6.
PyThaiNLP 1.8 supports Python 3.6+. Some functions may work with older version of Python 3, but it is not well-tested and will not be supported. See [PyThaiNLP 1.8 change log](https://github.com/PyThaiNLP/pythainlp/issues/118).

Python 2 users can use PyThaiNLP 1.6, our latest released that tested with Python 2.7.

**This is a document for development branch (post 1.7.x). Things will break. For a document for stable branch, see [master](https://github.com/PyThaiNLP/pythainlp/tree/master).**

Expand All @@ -34,21 +34,40 @@ Python 2 users can still use PyThaiNLP 1.6.

## Installation

**Using pip**
PyThaiNLP uses PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/

### Stable release

Stable release
Standard installation:

```sh
$ pip install pythainlp
```

Development release
For some advanced functionalities, like word vector, extra packages may be needed. Install them with these options during pip install:

```sh
$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
$ pip install pythainlp[extra1,extra2,...]
```

Note: PyTorch is required for ulmfit sentiment analyser. ```pip install torch``` is needed for the feature. gensim and keras packages may also needed for other modules that rely on these machine learning libraries.
where ```extras``` can be
- ```artagger``` (to support artagger part-of-speech tagger)
- ```deepcut``` (to support deepcut machine-learnt tokenizer)
- ```icu``` (for ICU support in transliteration and tokenization)
- ```ipa``` (for International Phonetic Alphabet support in transliteration)
- ```ml``` (to support ULMFit models, like one for sentiment analyser)
- ```ner``` (for named-entity recognizer)
- ```thai2rom``` (for machine-learnt romanization)
- ```thai2vec``` (for Thai word vector)
- ```full``` (install everything)

see ```extras``` and ```extras_require``` in [```setup.py```](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py) for details.

Development release:

```sh
$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
```

## Documentation

Expand Down
2 changes: 1 addition & 1 deletion appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ install:
# - "set ICU_VERSION=62"
- "%PYTHON%/python.exe -m pip install --upgrade pip"
- "%PYTHON%/python.exe -m pip install %PYICU_WHEEL%"
- "%PYTHON%/python.exe -m pip install -e .[icu,ner,pos,tokenize,transliterate]"
- "%PYTHON%/python.exe -m pip install -e .[icu,ipa,ner,thai2vec]"

test_script:
- "%PYTHON%/python.exe -m pip --version"
Expand Down
6 changes: 3 additions & 3 deletions pythainlp/number/wordtonum.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,11 @@


def _thaiword_to_num(tokens):
len_tokens = len(tokens)

if len_tokens == 0:
if not tokens:
return None

len_tokens = len(tokens)

if len_tokens == 1:
return _THAI_INT_MAP[tokens[0]]

Expand Down
14 changes: 9 additions & 5 deletions pythainlp/sentiment/ulmfit_sent.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@

# from fastai.text import multiBatchRNN

__all__ = ["about", "get_sentiment"]

MODEL_NAME = "sent_model"
ITOS_NAME = "itos_sent"

Expand All @@ -29,24 +31,26 @@ def get_path(fname):


# load model
model = torch.load(get_path(MODEL_NAME))
model.eval()
MODEL = torch.load(get_path(MODEL_NAME))
MODEL.eval()

# load itos and stoi
itos = pickle.load(open(get_path(ITOS_NAME), "rb"))
stoi = defaultdict(lambda: 0, {v: k for k, v in enumerate(itos)})


# get sentiment; 1 for positive and 0 for negative
# or score if specified return_score=True
softmax = lambda x: np.exp(x) / np.sum(np.exp(x))
def softmax(x):
return np.exp(x) / np.sum(np.exp(x))


def get_sentiment(text, return_score=False):
words = word_tokenize(text)
tensor = LongTensor([stoi[word] for word in words]).view(-1, 1).cpu()
tensor = Variable(tensor, volatile=False)
model.reset()
pred, *_ = model(tensor)
MODEL.reset()
pred, *_ = MODEL(tensor)
result = pred.data.cpu().numpy().reshape(-1)

if return_score:
Expand Down
21 changes: 15 additions & 6 deletions pythainlp/tag/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,30 @@ def pos_tag(words, engine="unigram", corpus="orchid"):
* pud - Parallel Universal Dependencies (PUD) treebanks
:return: returns a list of labels regarding which part of speech it is
"""
if not words:
return []

if engine == "perceptron":
from .perceptron import tag as _tag
from .perceptron import tag as tag_
elif engine == "artagger":

def _tag(text, corpus=None):
def tag_(words, corpus=None):
if not words:
return []

from artagger import Tagger
words = Tagger().tag(" ".join(text))
words_ = Tagger().tag(" ".join(words))

return [(word.word, word.tag) for word in words]
return [(word.word, word.tag) for word in words_]

else: # default, use "unigram" ("old") engine
from .unigram import tag as _tag
from .unigram import tag as tag_

return _tag(words, corpus=corpus)
return tag_(words, corpus=corpus)


def pos_tag_sents(sentences, engine="unigram", corpus="orchid"):
if not sentences:
return []

return [pos_tag(sent, engine=engine, corpus=corpus) for sent in sentences]
27 changes: 16 additions & 11 deletions pythainlp/tag/perceptron.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,33 @@
import dill
from pythainlp.corpus import CORPUS_PATH

_ORCHID_DATA_FILENAME = "orchid_pt_tagger.dill"
_PUD_DATA_FILENAME = "ud_thai_pud_pt_tagger.dill"

def orchid_data():
data_filename = os.path.join(CORPUS_PATH, "orchid_pt_tagger.dill")

def _load_tagger(filename):
data_filename = os.path.join(CORPUS_PATH, filename)
with open(data_filename, "rb") as fh:
model = dill.load(fh)
return model


def pud_data():
data_filename = os.path.join(CORPUS_PATH, "ud_thai_pud_pt_tagger.dill")
with open(data_filename, "rb") as fh:
model = dill.load(fh)
return model
_ORCHID_TAGGER = _load_tagger(_ORCHID_DATA_FILENAME)
_PUD_TAGGER = _load_tagger(_PUD_DATA_FILENAME)


def tag(text, corpus="pud"):
def tag(words, corpus="pud"):
"""
รับค่าเป็น ''list'' คืนค่าเป็น ''list'' เช่น [('คำ', 'ชนิดคำ'), ('คำ', 'ชนิดคำ'), ...]
"""
if not words:
return []

words = [word.strip() for word in words if word.strip()]

if corpus == "orchid":
tagger = orchid_data()
tagger = _ORCHID_TAGGER
else: # default, use "pud" as a corpus
tagger = pud_data()
tagger = _PUD_TAGGER

return tagger.tag(text)
return tagger.tag(words)
17 changes: 10 additions & 7 deletions pythainlp/tag/unigram.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,29 @@
_THAI_POS_PUD_PATH = os.path.join(CORPUS_PATH, _THAI_POS_PUD_FILENAME)


def orchid_data():
def _orchid_tagger():
with open(_THAI_POS_ORCHID_PATH, encoding="utf-8-sig") as f:
model = json.load(f)
return model


def pud_data():
def _pud_tagger():
with open(_THAI_POS_PUD_PATH, "rb") as handle:
model = dill.load(handle)
return model


def tag(text, corpus):
def tag(words, corpus):
"""
รับค่าเป็น ''list'' คืนค่าเป็น ''list'' เช่น [('คำ', 'ชนิดคำ'), ('คำ', 'ชนิดคำ'), ...]
"""
if not words:
return []

if corpus == "orchid":
tagger = nltk.tag.UnigramTagger(model=orchid_data())
return tagger.tag(text)
tagger = nltk.tag.UnigramTagger(model=_orchid_tagger())
return tagger.tag(words)

# default, use "pud" as a corpus
tagger = pud_data()
return tagger.tag(text)
tagger = _pud_tagger()
return tagger.tag(words)
Loading