Skip to content

Commit fed1ceb

Browse files
authored
Merge pull request #156 from bact/dev
More test cases - reached 80% coverage
2 parents 694dbf7 + af83c4d commit fed1ceb

23 files changed

+400
-111
lines changed

.travis.yml

+7-1
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,16 @@
44
language: python
55
python:
66
- "3.6"
7+
8+
# workaround to make boto work on travis
9+
# from https://github.com/travis-ci/travis-ci/issues/7940
10+
before_install:
11+
- sudo rm -f /etc/boto.cfg
12+
713
# command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
814
install:
915
- pip install -r requirements.txt
10-
- pip install .[icu,ner,pos,tokenize,transliterate]
16+
- pip install .[icu,ipa,ner,thai2vec]
1117
- pip install coveralls
1218

1319
os:

CONTRIBUTING.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ We use the famous [gitflow](http://nvie.com/posts/a-successful-git-branching-mod
2323
- Write tests for your new features (please see "Tests" topic below);
2424
- Always remember that [commented code is dead
2525
code](http://www.codinghorror.com/blog/2008/07/coding-without-comments.html);
26-
- Name identifiers (variables, classes, functions, module names) with readable
27-
names (`x` is always wrong);
26+
- Name identifiers (variables, classes, functions, module names) with meaningful
27+
and pronounceable names (`x` is always wrong);
2828
- When manipulating strings, use [Python's new-style
2929
formatting](http://docs.python.org/library/string.html#format-string-syntax)
3030
(`'{} = {}'.format(a, b)` instead of `'%s = %s' % (a, b)`);
@@ -55,7 +55,7 @@ Happy hacking! (;
5555
## newmm (onecut), mm, TCC, and Thai Soundex Code
5656
- Korakot Chaovavanich
5757

58-
## Thai2Vec & ulmfit
58+
## Thai2Vec & ULMFiT
5959
- Charin Polpanumas
6060

6161
## Docs

README-pypi.md

+8-14
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,14 @@
1010

1111
PyThaiNLP is a Python library for natural language processing (NLP) of Thai language.
1212

13-
PyThaiNLP features include Thai word and subword segmentations, soundex, romanization, part-of-speech taggers, and spelling corrections.
14-
15-
## What's new in version 1.7 ?
16-
17-
- Deprecate Python 2 support. (Python 2 compatibility code will be completely dropped in PyThaiNLP 1.8)
18-
- Refactor pythainlp.tokenize.pyicu for readability
19-
- Add Thai NER model to pythainlp.ner
20-
- thai2vec v0.2 - larger vocab, benchmarking results on Wongnai dataset
21-
- Sentiment classifier based on ULMFit and various product review datasets
22-
- Add ULMFit utility to PyThaiNLP
23-
- Add Thai romanization model ThaiTransliterator
24-
- Retrain POS-tagging model
25-
- Improved word_tokenize (newmm, mm) and dict_word_tokenize
26-
- Documentation added
13+
PyThaiNLP includes Thai word tokenizers, transliterators, soundex converters, part-of-speech taggers, and spell checkers.
14+
15+
## What's new in version 1.8 ?
16+
17+
- New NorvigSpellChecker spell checker class, which can be initialized with custom dictionary.
18+
- Terminate Python 2 support. Remove all Python 2 compatibility code.
19+
- Remove old, obsolated, deprecated, and experimental code.
20+
- see [PyThaiNLP 1.8 change log](https://github.com/PyThaiNLP/pythainlp/issues/118)
2721

2822
## Install
2923

README.md

+27-8
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ Thai Natural Language Processing in Python.
1212

1313
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to `nltk` but with focus on Thai language.
1414

15-
PyThaiNLP supports Python 3.4+.
16-
Since version 1.7, PyThaiNLP deprecates its support for Python 2. The future PyThaiNLP 1.8 will completely drop all supports for Python 2.
17-
Python 2 users can still use PyThaiNLP 1.6.
15+
PyThaiNLP 1.8 supports Python 3.6+. Some functions may work with older version of Python 3, but it is not well-tested and will not be supported. See [PyThaiNLP 1.8 change log](https://github.com/PyThaiNLP/pythainlp/issues/118).
16+
17+
Python 2 users can use PyThaiNLP 1.6, our latest released that tested with Python 2.7.
1818

1919
**This is a document for development branch (post 1.7.x). Things will break. For a document for stable branch, see [master](https://github.com/PyThaiNLP/pythainlp/tree/master).**
2020

@@ -34,21 +34,40 @@ Python 2 users can still use PyThaiNLP 1.6.
3434

3535
## Installation
3636

37-
**Using pip**
37+
PyThaiNLP uses PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/
38+
39+
### Stable release
3840

39-
Stable release
41+
Standard installation:
4042

4143
```sh
4244
$ pip install pythainlp
4345
```
4446

45-
Development release
47+
For some advanced functionalities, like word vector, extra packages may be needed. Install them with these options during pip install:
4648

4749
```sh
48-
$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
50+
$ pip install pythainlp[extra1,extra2,...]
4951
```
5052

51-
Note: PyTorch is required for ulmfit sentiment analyser. ```pip install torch``` is needed for the feature. gensim and keras packages may also needed for other modules that rely on these machine learning libraries.
53+
where ```extras``` can be
54+
- ```artagger``` (to support artagger part-of-speech tagger)
55+
- ```deepcut``` (to support deepcut machine-learnt tokenizer)
56+
- ```icu``` (for ICU support in transliteration and tokenization)
57+
- ```ipa``` (for International Phonetic Alphabet support in transliteration)
58+
- ```ml``` (to support ULMFit models, like one for sentiment analyser)
59+
- ```ner``` (for named-entity recognizer)
60+
- ```thai2rom``` (for machine-learnt romanization)
61+
- ```thai2vec``` (for Thai word vector)
62+
- ```full``` (install everything)
63+
64+
see ```extras``` and ```extras_require``` in [```setup.py```](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py) for details.
65+
66+
Development release:
67+
68+
```sh
69+
$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
70+
```
5271

5372
## Documentation
5473

appveyor.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ install:
3232
# - "set ICU_VERSION=62"
3333
- "%PYTHON%/python.exe -m pip install --upgrade pip"
3434
- "%PYTHON%/python.exe -m pip install %PYICU_WHEEL%"
35-
- "%PYTHON%/python.exe -m pip install -e .[icu,ner,pos,tokenize,transliterate]"
35+
- "%PYTHON%/python.exe -m pip install -e .[icu,ipa,ner,thai2vec]"
3636

3737
test_script:
3838
- "%PYTHON%/python.exe -m pip --version"

pythainlp/number/wordtonum.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,11 @@
4040

4141

4242
def _thaiword_to_num(tokens):
43-
len_tokens = len(tokens)
44-
45-
if len_tokens == 0:
43+
if not tokens:
4644
return None
4745

46+
len_tokens = len(tokens)
47+
4848
if len_tokens == 1:
4949
return _THAI_INT_MAP[tokens[0]]
5050

pythainlp/sentiment/ulmfit_sent.py

+9-5
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515

1616
# from fastai.text import multiBatchRNN
1717

18+
__all__ = ["about", "get_sentiment"]
19+
1820
MODEL_NAME = "sent_model"
1921
ITOS_NAME = "itos_sent"
2022

@@ -29,24 +31,26 @@ def get_path(fname):
2931

3032

3133
# load model
32-
model = torch.load(get_path(MODEL_NAME))
33-
model.eval()
34+
MODEL = torch.load(get_path(MODEL_NAME))
35+
MODEL.eval()
3436

3537
# load itos and stoi
3638
itos = pickle.load(open(get_path(ITOS_NAME), "rb"))
3739
stoi = defaultdict(lambda: 0, {v: k for k, v in enumerate(itos)})
3840

41+
3942
# get sentiment; 1 for positive and 0 for negative
4043
# or score if specified return_score=True
41-
softmax = lambda x: np.exp(x) / np.sum(np.exp(x))
44+
def softmax(x):
45+
return np.exp(x) / np.sum(np.exp(x))
4246

4347

4448
def get_sentiment(text, return_score=False):
4549
words = word_tokenize(text)
4650
tensor = LongTensor([stoi[word] for word in words]).view(-1, 1).cpu()
4751
tensor = Variable(tensor, volatile=False)
48-
model.reset()
49-
pred, *_ = model(tensor)
52+
MODEL.reset()
53+
pred, *_ = MODEL(tensor)
5054
result = pred.data.cpu().numpy().reshape(-1)
5155

5256
if return_score:

pythainlp/tag/__init__.py

+15-6
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,30 @@ def pos_tag(words, engine="unigram", corpus="orchid"):
2020
* pud - Parallel Universal Dependencies (PUD) treebanks
2121
:return: returns a list of labels regarding which part of speech it is
2222
"""
23+
if not words:
24+
return []
25+
2326
if engine == "perceptron":
24-
from .perceptron import tag as _tag
27+
from .perceptron import tag as tag_
2528
elif engine == "artagger":
2629

27-
def _tag(text, corpus=None):
30+
def tag_(words, corpus=None):
31+
if not words:
32+
return []
33+
2834
from artagger import Tagger
29-
words = Tagger().tag(" ".join(text))
35+
words_ = Tagger().tag(" ".join(words))
3036

31-
return [(word.word, word.tag) for word in words]
37+
return [(word.word, word.tag) for word in words_]
3238

3339
else: # default, use "unigram" ("old") engine
34-
from .unigram import tag as _tag
40+
from .unigram import tag as tag_
3541

36-
return _tag(words, corpus=corpus)
42+
return tag_(words, corpus=corpus)
3743

3844

3945
def pos_tag_sents(sentences, engine="unigram", corpus="orchid"):
46+
if not sentences:
47+
return []
48+
4049
return [pos_tag(sent, engine=engine, corpus=corpus) for sent in sentences]

pythainlp/tag/perceptron.py

+16-11
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,33 @@
77
import dill
88
from pythainlp.corpus import CORPUS_PATH
99

10+
_ORCHID_DATA_FILENAME = "orchid_pt_tagger.dill"
11+
_PUD_DATA_FILENAME = "ud_thai_pud_pt_tagger.dill"
1012

11-
def orchid_data():
12-
data_filename = os.path.join(CORPUS_PATH, "orchid_pt_tagger.dill")
13+
14+
def _load_tagger(filename):
15+
data_filename = os.path.join(CORPUS_PATH, filename)
1316
with open(data_filename, "rb") as fh:
1417
model = dill.load(fh)
1518
return model
1619

1720

18-
def pud_data():
19-
data_filename = os.path.join(CORPUS_PATH, "ud_thai_pud_pt_tagger.dill")
20-
with open(data_filename, "rb") as fh:
21-
model = dill.load(fh)
22-
return model
21+
_ORCHID_TAGGER = _load_tagger(_ORCHID_DATA_FILENAME)
22+
_PUD_TAGGER = _load_tagger(_PUD_DATA_FILENAME)
2323

2424

25-
def tag(text, corpus="pud"):
25+
def tag(words, corpus="pud"):
2626
"""
2727
รับค่าเป็น ''list'' คืนค่าเป็น ''list'' เช่น [('คำ', 'ชนิดคำ'), ('คำ', 'ชนิดคำ'), ...]
2828
"""
29+
if not words:
30+
return []
31+
32+
words = [word.strip() for word in words if word.strip()]
33+
2934
if corpus == "orchid":
30-
tagger = orchid_data()
35+
tagger = _ORCHID_TAGGER
3136
else: # default, use "pud" as a corpus
32-
tagger = pud_data()
37+
tagger = _PUD_TAGGER
3338

34-
return tagger.tag(text)
39+
return tagger.tag(words)

pythainlp/tag/unigram.py

+10-7
Original file line numberDiff line numberDiff line change
@@ -15,26 +15,29 @@
1515
_THAI_POS_PUD_PATH = os.path.join(CORPUS_PATH, _THAI_POS_PUD_FILENAME)
1616

1717

18-
def orchid_data():
18+
def _orchid_tagger():
1919
with open(_THAI_POS_ORCHID_PATH, encoding="utf-8-sig") as f:
2020
model = json.load(f)
2121
return model
2222

2323

24-
def pud_data():
24+
def _pud_tagger():
2525
with open(_THAI_POS_PUD_PATH, "rb") as handle:
2626
model = dill.load(handle)
2727
return model
2828

2929

30-
def tag(text, corpus):
30+
def tag(words, corpus):
3131
"""
3232
รับค่าเป็น ''list'' คืนค่าเป็น ''list'' เช่น [('คำ', 'ชนิดคำ'), ('คำ', 'ชนิดคำ'), ...]
3333
"""
34+
if not words:
35+
return []
36+
3437
if corpus == "orchid":
35-
tagger = nltk.tag.UnigramTagger(model=orchid_data())
36-
return tagger.tag(text)
38+
tagger = nltk.tag.UnigramTagger(model=_orchid_tagger())
39+
return tagger.tag(words)
3740

3841
# default, use "pud" as a corpus
39-
tagger = pud_data()
40-
return tagger.tag(text)
42+
tagger = _pud_tagger()
43+
return tagger.tag(words)

0 commit comments

Comments
 (0)