You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-35
Original file line number
Diff line number
Diff line change
@@ -13,20 +13,19 @@
13
13
<ahref="https://matrix.to/#/#thainlp:matrix.org"rel="noopener"target="_blank"><imgsrc="https://matrix.to/img/matrix-badge.svg"alt="Chat on Matrix"></a>
14
14
</div>
15
15
16
-
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with a focus on the Thai language.
16
+
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with a focus on Thai language.
> Now, You can contact with or ask any questions of the PyThaiNLP team. <ahref="https://matrix.to/#/#thainlp:matrix.org"rel="noopener"target="_blank"><imgsrc="https://matrix.to/img/matrix-badge.svg"alt="Chat on Matrix"></a>
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via the command-line interface.
39
+
PyThaiNLP provides standard linguistic analysis for Thai language and standard Thai locale utility functions.
40
+
Some of these functions are also available via the command-line interface (run `thainlp` in your shell).
41
41
42
-
<details>
43
-
<summary>List of Features</summary>
42
+
Partial list of features:
44
43
45
44
- Convenient character and word classes, like Thai consonants (`pythainlp.thai_consonants`), vowels (`pythainlp.thai_vowels`), digits (`pythainlp.thai_digits`), and stop words (`pythainlp.corpus.thai_stopwords`) -- comparable to constants like `string.letters`, `string.digits`, and `string.punctuation`
46
-
- Thai linguistic unit segmentation/tokenization, including sentence (`sent_tokenize`), word (`word_tokenize`), and subword segmentations based on Thai Character Cluster (`subword_tokenize`)
47
-
- Thai part-of-speech tagging (`pos_tag`)
48
-
- Thai spelling suggestion and correction (`spell` and `correct`)
49
-
- Thai transliteration (`transliterate`)
50
-
- Thai soundex (`soundex`) with three engines (`lk82`, `udom83`, `metasound`)
51
-
- Thai collation (sorted by dictionary order) (`collate`)
52
-
- Read out number to Thai words (`bahttext`, `num_to_thaiword`)
53
-
- Thai datetime formatting (`thai_strftime`)
45
+
- Linguistic unit segmentation at different levels: sentence (`sent_tokenize`), word (`word_tokenize`), and subword (`subword_tokenize`)
46
+
- Part-of-speech tagging (`pos_tag`)
47
+
- Spelling suggestion and correction (`spell` and `correct`)
48
+
- Phonetic algorithm and transliteration (`soundex` and `transliterate`)
49
+
- Collation (sorted by dictionary order) (`collate`)
50
+
- Number read out (`num_to_thaiword` and `bahttext`)
-`icu` (for ICU, International Components for Unicode, support in transliteration and tokenization)
82
+
-`ipa` (for IPA, International Phonetic Alphabet, support in transliteration)
83
+
-`ml` (to support ULMFiT models for classification)
84
+
-`thai2fit` (for Thai word vector)
85
+
-`thai2rom` (for machine-learnt romanization)
86
+
-`wordnet` (for Thai WordNet API)
94
87
95
88
For dependency details, look at the `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).
96
89
97
-
98
90
## Data Directory
99
91
100
92
- Some additional data, like word lists and language models, may be automatically downloaded during runtime.
101
93
- PyThaiNLP caches these data under the directory `~/pythainlp-data` by default.
102
94
- The data directory can be changed by specifying the environment variable `PYTHAINLP_DATA_DIR`.
103
95
- See the data catalog (`db.json`) at https://github.com/PyThaiNLP/pythainlp-corpus
104
96
105
-
106
97
## Command-Line Interface
107
98
108
99
Some of PyThaiNLP functionalities can be used via command line with the `thainlp` command.
109
100
110
101
For example, to display a catalog of datasets:
102
+
111
103
```sh
112
104
thainlp data catalog
113
105
```
114
106
115
107
To show how to use:
108
+
116
109
```sh
117
110
thainlp help
118
111
```
119
112
120
-
121
113
## Licenses
122
114
123
115
|| License |
@@ -127,7 +119,6 @@ thainlp help
127
119
| Language models created by PyThaiNLP |[Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/)|
128
120
| Other corpora and models that may be included in PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md)|
129
121
130
-
131
122
## Contribute to PyThaiNLP
132
123
133
124
- Please fork and create a pull request :)
@@ -137,7 +128,6 @@ thainlp help
137
128
138
129
You can read [INTHEWILD.md](https://github.com/PyThaiNLP/pythainlp/blob/dev/INTHEWILD.md).
139
130
140
-
141
131
## Citations
142
132
143
133
If you use `PyThaiNLP` in your project or publication, please cite the library as follows:
| Language models created by PyThaiNLP |[Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/)|
0 commit comments