From af7ad9a872f43c31a27135a8afe476528be6ec25 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 03:37:25 +0530 Subject: [PATCH 01/22] Update benchmarks.rst In the updated documentation for the pythainlp.benchmarks module, several improvements have been introduced to enhance clarity and comprehensibility. The primary objective was to provide a comprehensive introduction to the module, emphasizing its purpose and the services it offers. Notable changes include: Introduction: The documentation now starts with a clear introduction to the pythainlp.benchmarks module, highlighting its role in benchmarking Thai NLP tasks. Users can easily grasp the module's intended use and its focus on evaluating NLP tasks in the Thai language. Tokenization: The "Tokenization" section has been elaborated to stress the importance of word tokenization in NLP and its relevance to various applications. Users are now more informed about the significance of benchmarking tokenization methods and why this module is a valuable resource. Quality Evaluation: An entirely new subsection has been added to introduce the concept of quality evaluation in word tokenization. This section emphasizes the impact of tokenization quality on downstream NLP tasks and the necessity of assessment. A visual representation of the evaluation process has been included for better visualization. Functions: Each benchmarking function, including compute_stats, benchmark, and preprocessing, has been given a brief description. Users can now quickly understand the purpose of each function and how they can be used in practice. Usage: The "Usage" section now encourages users to refer to the official PyThaiNLP documentation for examples and guidelines on utilizing the benchmarking functions. This provides users with clear guidance on how to get started with benchmarking word tokenization in their projects. --- docs/api/benchmarks.rst | 45 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/docs/api/benchmarks.rst b/docs/api/benchmarks.rst index 418e53b6f..69093ca7d 100644 --- a/docs/api/benchmarks.rst +++ b/docs/api/benchmarks.rst @@ -14,11 +14,56 @@ Tokenization Quality ^^^^ +.. figure:: ../images/evaluation.png + :scale: 50 % + +Qualitative evaluation of word tokenization. + +.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats +.. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark +.. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing + +.. currentmodule:: pythainlp.benchmarks + +pythainlp.benchmarks Module +=========================== + +Introduction +------------ + +The `pythainlp.benchmarks` module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). Currently, the module includes tools for word tokenization benchmarking. Please note that additional benchmarking tasks will be incorporated in the future. + +Tokenization +------------ + +Word tokenization is a fundamental task in NLP, and it plays a crucial role in various applications, such as text analysis and language processing. The `pythainlp.benchmarks` module offers a set of functions to assist in the benchmarking and evaluation of word tokenization methods. + +Quality Evaluation +^^^^^^^^^^^^^^^^^^ + +The quality of word tokenization can significantly impact the accuracy of downstream NLP tasks. To assess the quality of word tokenization, the module provides a qualitative evaluation using various metrics and techniques. + .. figure:: ../images/evaluation.png :scale: 50 % Qualitative evaluation of word tokenization. +Functions +--------- + .. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats + + This function is used to compute various statistics and metrics related to word tokenization. It allows you to assess the performance of different tokenization methods. + .. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark + + The `benchmark` function facilitates the benchmarking of word tokenization methods. It provides an organized framework for evaluating and comparing the effectiveness of different tokenization tools. + .. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing + + Preprocessing is a crucial step in NLP tasks. The `preprocessing` function assists in preparing text data for tokenization, which is essential for accurate and consistent benchmarking. + +Usage +----- + +To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods. From 2a21760e13a51d9dedc74fe723fd2855a3b2fc6c Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 04:53:41 +0530 Subject: [PATCH 02/22] Update benchmarks.rst In the updated documentation for the pythainlp.benchmarks module, several improvements have been introduced to enhance clarity and comprehensibility. The primary objective was to provide a comprehensive introduction to the module, emphasizing its purpose and the services it offers. Notable changes include: Introduction: The documentation now starts with a clear introduction to the pythainlp.benchmarks module, highlighting its role in benchmarking Thai NLP tasks. Users can easily grasp the module's intended use and its focus on evaluating NLP tasks in the Thai language. Tokenization: The "Tokenization" section has been elaborated to stress the importance of word tokenization in NLP and its relevance to various applications. Users are now more informed about the significance of benchmarking tokenization methods and why this module is a valuable resource. Quality Evaluation: An entirely new subsection has been added to introduce the concept of quality evaluation in word tokenization. This section emphasizes the impact of tokenization quality on downstream NLP tasks and the necessity of assessment. A visual representation of the evaluation process has been included for better visualization. Functions: Each benchmarking function, including compute_stats, benchmark, and preprocessing, has been given a brief description. Users can now quickly understand the purpose of each function and how they can be used in practice. --- docs/api/benchmarks.rst | 25 ------------------------- 1 file changed, 25 deletions(-) diff --git a/docs/api/benchmarks.rst b/docs/api/benchmarks.rst index 69093ca7d..bf9e6047a 100644 --- a/docs/api/benchmarks.rst +++ b/docs/api/benchmarks.rst @@ -2,31 +2,6 @@ pythainlp.benchmarks ==================================== -The :class:`pythainlp.benchmarks` contains utility functions for benchmarking -tasked related to Thai NLP. At the moment, we have only for word tokenization. -Other tasks will be added soon. - -Modules -------- - -Tokenization -********* - -Quality -^^^^ -.. figure:: ../images/evaluation.png - :scale: 50 % - -Qualitative evaluation of word tokenization. - -.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats -.. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark -.. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing - -.. currentmodule:: pythainlp.benchmarks - -pythainlp.benchmarks Module -=========================== Introduction ------------ From 669230452808c6fcfd5ba415b4910d4785107d41 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 05:06:12 +0530 Subject: [PATCH 03/22] Update augment.rst The enhanced documentation for the pythainlp.augment module brings about several notable improvements. These changes focus on providing users with a more comprehensive understanding of the module and its various components for text augmentation in the Thai language. Here's an overview of the key changes: Introduction: The documentation now starts with a clear introduction, emphasizing the importance of text augmentation in NLP and its specific relevance to the Thai language. This introduction sets the stage for the entire module, making it clear why text augmentation is a crucial task. TextAugment Class: The central TextAugment class is highlighted, and its purpose as the core component of the module is explained. Users can now understand that this class serves as the gateway to various text augmentation techniques. Class Details: Each class within the module, such as WordNetAug, Word2VecAug, FastTextAug, and BPEmbAug, is provided with a detailed description of its purpose and capabilities. This clarity allows users to determine which class is best suited for their specific text augmentation needs. Function Descriptions: The postype2wordnet function's role in mapping part-of-speech tags to WordNet-compatible POS tags is clearly explained, facilitating the integration of WordNet augmentation with Thai text. Users can better understand how to work with this function in their text augmentation tasks. Usage Guidance: The documentation emphasizes that users can refer to the official PyThaiNLP documentation for detailed usage examples and guidelines. This encourages users to explore the module's full potential for enriching and diversifying Thai text data and improving NLP models and applications. These changes make the documentation more informative and accessible, making it easier for researchers, developers, and practitioners to understand how to leverage the pythainlp.augment module effectively. With this enhanced documentation, users can confidently harness the power of text augmentation for Thai language NLP tasks. --- docs/api/augment.rst | 72 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 58 insertions(+), 14 deletions(-) diff --git a/docs/api/augment.rst b/docs/api/augment.rst index 220cc21c8..bff34aed3 100644 --- a/docs/api/augment.rst +++ b/docs/api/augment.rst @@ -1,25 +1,69 @@ .. currentmodule:: pythainlp.augment -pythainlp.augment -================= +pythainlp.augment Module +======================= -The :class:`textaugment` is Thai text augment. This function for text augment task. +Introduction +------------ -Modules -------- +The `pythainlp.augment` module is a powerful toolset for text augmentation in the Thai language. Text augmentation is a process that enriches and diversifies textual data by generating alternative versions of the original text. This module is a valuable resource for improving the quality and variety of Thai language data for NLP tasks. + +TextAugment Class +----------------- + +The central component of the `pythainlp.augment` module is the `TextAugment` class. This class provides various text augmentation techniques and functions to enhance the diversity of your text data. It offers the following methods: + +.. autoclass:: pythainlp.augment.TextAugment + :members: + +WordNetAug Class +---------------- + +The `WordNetAug` class is designed to perform text augmentation using WordNet, a lexical database for English. This class enables you to augment Thai text using English synonyms, offering a unique approach to text diversification. The following methods are available within this class: + +.. autoclass:: pythainlp.augment.WordNetAug + :members: + +Word2VecAug, Thai2fitAug, LTW2VAug Classes +------------------------------------------ + +The `pythainlp.augment.word2vec` package contains multiple classes for text augmentation using Word2Vec models. These classes include `Word2VecAug`, `Thai2fitAug`, and `LTW2VAug`. Each of these classes allows you to use Word2Vec embeddings to generate text variations. Explore the methods provided by these classes to understand their capabilities. -.. autoclass:: WordNetAug - :members: -.. autofunction:: postype2wordnet .. autoclass:: pythainlp.augment.word2vec.Word2VecAug - :members: + :members: + .. autoclass:: pythainlp.augment.word2vec.Thai2fitAug - :members: + :members: + .. autoclass:: pythainlp.augment.word2vec.LTW2VAug - :members: + :members: + +FastTextAug and Thai2transformersAug Classes +-------------------------------------------- + +The `pythainlp.augment.lm` package offers classes for text augmentation using language models. These classes include `FastTextAug` and `Thai2transformersAug`. These classes allow you to use language model-based techniques to diversify text data. Explore their methods to understand their capabilities. + .. autoclass:: pythainlp.augment.lm.FastTextAug - :members: + :members: + .. autoclass:: pythainlp.augment.lm.Thai2transformersAug - :members: + :members: + +BPEmbAug Class +-------------- + +The `pythainlp.augment.word2vec.bpemb_wv` package contains the `BPEmbAug` class, which is designed for text augmentation using subword embeddings. This class is particularly useful when working with subword representations for Thai text augmentation. + .. autoclass:: pythainlp.augment.word2vec.bpemb_wv.BPEmbAug - :members: \ No newline at end of file + :members: + +Additional Functions +------------------- + +To further enhance your text augmentation tasks, the `pythainlp.augment` module offers the following functions: + +- `postype2wordnet`: This function maps part-of-speech tags to WordNet-compatible POS tags, facilitating the integration of WordNet augmentation with Thai text. + +These functions and classes provide diverse techniques for text augmentation in the Thai language, making this module a valuable asset for NLP researchers, developers, and practitioners. + +For detailed usage examples and guidelines, please refer to the official PyThaiNLP documentation. The `pythainlp.augment` module opens up new possibilities for enriching and diversifying Thai text data, leading to improved NLP models and applications. From ba322308e1de0bb6e69130995a9613c425ec452d Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 05:38:54 +0530 Subject: [PATCH 04/22] Update coref.rst MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The updated documentation for the pythainlp.coref module aims to provide a more comprehensive understanding of its purpose and utility for coreference resolution in the Thai language. Here are the key changes and their significance: Introduction: The introduction now explicitly mentions that the module is dedicated to coreference resolution for Thai, clarifying its specific purpose. This addition ensures that users quickly grasp the module's specialization and its role in addressing coreference challenges in Thai text. Coreference Resolution Function: The core of the module, the coreference_resolution function, is introduced and explained in detail. Users are informed about the task it performs – identifying expressions referring to the same entities in text. This clarity is essential for users to understand the central function of the module. Usage: The usage section provides a step-by-step guide on how to use the coreference_resolution function effectively. It includes an example to illustrate the process, making it more user-friendly. This practical guidance empowers users to start using the module immediately in their NLP tasks. Conclusion: The conclusion reiterates the module's significance, emphasizing its role in enhancing NLP systems' understanding of Thai text. It encourages users to explore the official PyThaiNLP documentation for more details. This promotes continued learning and utilization of the module's capabilities. --- docs/api/coref.rst | 34 +++++++++++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/docs/api/coref.rst b/docs/api/coref.rst index daf5690bc..9a786364e 100644 --- a/docs/api/coref.rst +++ b/docs/api/coref.rst @@ -2,9 +2,37 @@ pythainlp.coref =============== -The :class:`pythainlp.coref` is Coreference Resolution for Thai. +Introduction +------------ + +The `pythainlp.coref` module is dedicated to Coreference Resolution for the Thai language. Coreference resolution is a crucial task in natural language processing (NLP) that deals with identifying and linking expressions (such as pronouns) in a text to the entities or concepts they refer to. This module provides tools to tackle coreference resolution challenges in the context of the Thai language. -Modules -------- +Coreference Resolution Function +------------------------------- + +The primary component of the `pythainlp.coref` module is the `coreference_resolution` function. This function is designed to analyze text and identify instances of coreference, helping NLP systems understand when different expressions in the text refer to the same entity. Here's how you can use it: + +The :class:`pythainlp.coref` is Coreference Resolution for Thai. .. autofunction:: coreference_resolution + +Usage +----- + +To use the `coreference_resolution` function effectively, follow these steps: + +1. Import the `coreference_resolution` function from the `pythainlp.coref` module. + +2. Pass the Thai text you want to analyze for coreferences as input to the function. + +3. The function will process the text and return information about coreference relationships within the text. + +Example: + +```python +from pythainlp.coref import coreference_resolution + +text = "นาย A มาจาก กรุงเทพ และเขา มีความรักต่อ บางกิจ ของเขา" +coreferences = coreference_resolution(text) + +print(coreferences) From 6cff37d60bd02e3c260f3a41a382e5493834b427 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 05:49:17 +0530 Subject: [PATCH 05/22] Update corpus.rst In this enhanced documentation for the pythainlp.corpus module, several improvements have been made to enhance its clarity and usefulness for users. Here's an extended description of the changes: Introduction and Purpose: The documentation begins with a concise introduction, highlighting the purpose of the pythainlp.corpus module. It clarifies that this module provides access to Thai language corpora and resources that come bundled with PyThaiNLP. This sets the stage for users, making it clear what to expect. Modules: Each module in the pythainlp.corpus package is described more thoroughly. The functions within each module are listed, and the :noindex: directive is used to suppress automatic indexing. This simplifies navigation and makes it easier for users to find the information they need. ConceptNet: A brief description of ConceptNet is provided, along with a link to the ConceptNet documentation. Users are directed to external resources for more in-depth information, making the documentation more informative. TNC (Thai National Corpus) and TTC (Thai Textbook Corpus): These two corpus modules have been explained more clearly. Users can now understand that they provide access to word frequency data and the source of the data. OSCAR: The OSCAR module is introduced as a multilingual corpus with access to word frequency data. Users can better understand its purpose and utility. Util: The "Util" section now explicitly states that it contains utilities for working with corpus data, providing context for its functions. WordNet: The WordNet section now mentions that it's an exact copy of NLTK's WordNet API and includes a link to the NLTK WordNet documentation. This helps users understand its origin and where to find more extensive information. Definition of "Synset": A definition of "Synset" has been added, clarifying its meaning as a set of synonyms with a common meaning. This is a critical term for understanding WordNet functionality. Overall Structure: The documentation maintains a consistent structure with clear headings and subheadings, making it easy for users to navigate and find the specific information they need. These changes are designed to make the documentation more user-friendly and informative. Users can now gain a better understanding of the purpose of each module and how to use them effectively. Additionally, by including references to external resources and clarifying key terms, users can access more in-depth information when needed. --- docs/api/corpus.rst | 206 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 198 insertions(+), 8 deletions(-) diff --git a/docs/api/corpus.rst b/docs/api/corpus.rst index b68ffacc3..6c5dbf72c 100644 --- a/docs/api/corpus.rst +++ b/docs/api/corpus.rst @@ -2,90 +2,280 @@ pythainlp.corpus ==================================== -The :class:`pythainlp.corpus` provides access to corpus that comes with PyThaiNLP. +The :class:`pythainlp.corpus` module provides access to various Thai language corpora and resources that come bundled with PyThaiNLP. These resources are essential for natural language processing tasks in the Thai language. Modules ------- +countries +~~~~~~~~~~ .. autofunction:: countries + :noindex: + +get_corpus +~~~~~~~~~~ .. autofunction:: get_corpus + :noindex: + +get_corpus_db +~~~~~~~~~~~~~~ .. autofunction:: get_corpus_db + :noindex: + +get_corpus_db_detail +~~~~~~~~~~~~~~~~~~~~ .. autofunction:: get_corpus_db_detail + :noindex: + +get_corpus_default_db +~~~~~~~~~~~~~~~~~~~~ .. autofunction:: get_corpus_default_db + :noindex: + +get_corpus_path +~~~~~~~~~~~~~~ .. autofunction:: get_corpus_path + :noindex: + +download +~~~~~~~~~~ .. autofunction:: download + :noindex: + +remove +~~~~~~~ .. autofunction:: remove + :noindex: + +provinces +~~~~~~~~~~ .. autofunction:: provinces + :noindex: + +thai_dict +~~~~~~~~~~ .. autofunction:: thai_dict + :noindex: + +thai_stopwords +~~~~~~~~~~~~~~ .. autofunction:: thai_stopwords + :noindex: + +thai_words +~~~~~~~~~~ .. autofunction:: thai_words + :noindex: + +thai_wsd_dict +~~~~~~~~~~~~~~ .. autofunction:: thai_wsd_dict + :noindex: + +thai_orst_words +~~~~~~~~~~~~~~~~~ .. autofunction:: thai_orst_words + :noindex: + +thai_synonym +~~~~~~~~~~~~~~ .. autofunction:: thai_synonym + :noindex: + +thai_syllables +~~~~~~~~~~~~~~ .. autofunction:: thai_syllables + :noindex: + +thai_negations +~~~~~~~~~~~~~~ .. autofunction:: thai_negations + :noindex: + +thai_family_names +~~~~~~~~~~~~~~~~~~~ .. autofunction:: thai_family_names + :noindex: + +thai_female_names +~~~~~~~~~~~~~~~~~~~ .. autofunction:: thai_female_names + :noindex: + +thai_male_names +~~~~~~~~~~~~~~~~ .. autofunction:: thai_male_names + :noindex: + +pythainlp.corpus.th_en_translit.get_transliteration_dict +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.th_en_translit.get_transliteration_dict + :noindex: ConceptNet ---------- -ConceptNet is an open, multilingual knowledge graph -See: https://github.com/commonsense/conceptnet5/wiki/API +ConceptNet is an open, multilingual knowledge graph used for various natural language understanding tasks. For more information, refer to the `ConceptNet documentation `_. +pythainlp.corpus.conceptnet.edges +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.conceptnet.edges + :noindex: -TNC +TNC (Thai National Corpus) --- +The Thai National Corpus (TNC) is a collection of text data in the Thai language. This module provides access to word frequency data from the TNC corpus. + +pythainlp.corpus.tnc.word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.word_freqs + :noindex: + +pythainlp.corpus.tnc.unigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.unigram_word_freqs + :noindex: + +pythainlp.corpus.tnc.bigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.bigram_word_freqs + :noindex: + +pythainlp.corpus.tnc.trigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.trigram_word_freqs + :noindex: -TTC +TTC (Thai Textbook Corpus) --- +The Thai Textbook Corpus (TTC) is a collection of Thai language text data, primarily sourced from textbooks. + +pythainlp.corpus.ttc.word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.ttc.word_freqs + :noindex: + +pythainlp.corpus.ttc.unigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.ttc.unigram_word_freqs + :noindex: OSCAR ----- +OSCAR is a multilingual corpus that includes Thai text data. This module provides access to word frequency data from the OSCAR corpus. + +pythainlp.corpus.oscar.word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.oscar.word_freqs + :noindex: + +pythainlp.corpus.oscar.unigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.oscar.unigram_word_freqs + :noindex: Util ---- +Utilities for working with the corpus data. + +pythainlp.corpus.util.find_badwords +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.util.find_badwords + :noindex: + +pythainlp.corpus.util.revise_wordset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.util.revise_wordset + :noindex: + +pythainlp.corpus.util.revise_newmm_default_wordset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.util.revise_newmm_default_wordset + :noindex: WordNet ------- -PyThaiNLP API is an exact copy of NLTK WordNet API. -See: https://www.nltk.org/howto/wordnet.html +PyThaiNLP API includes the WordNet module, which is an exact copy of NLTK's WordNet API for the Thai language. WordNet is a lexical database for English and other languages. + +For more details on WordNet, refer to the `NLTK WordNet documentation `_. +pythainlp.corpus.wordnet.synsets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.synsets + :noindex: + +pythainlp.corpus.wordnet.synset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.synset + :noindex: + +pythainlp.corpus.wordnet.all_lemma_names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.all_lemma_names + :noindex: + +pythainlp.corpus.wordnet.all_synsets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.all_synsets + :noindex: + +pythainlp.corpus.wordnet.langs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.langs + :noindex: + +pythainlp.corpus.wordnet.lemmas +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lemmas + :noindex: + +pythainlp.corpus.wordnet.lemma +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lemma + :noindex: + +pythainlp.corpus.wordnet.lemma_from_key +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lemma_from_key + :noindex: + +pythainlp.corpus.wordnet.path_similarity +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.path_similarity + :noindex: + +pythainlp.corpus.wordnet.lch_similarity +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lch_similarity + :noindex: + +pythainlp.corpus.wordnet.wup_similarity +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.wup_similarity + :noindex: + +pythainlp.corpus.wordnet.morphy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.morphy + :noindex: + +pythainlp.corpus.wordnet.custom_lemmas +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.custom_lemmas + :noindex: Definition ++++++++++ Synset - a set of synonyms that share a common meaning. +~~~~~~~ +A synset is a set of synonyms that share a common meaning. The WordNet module provides functionality to work with these synsets. + +This documentation is designed to help you navigate and use the various resources and modules available in the `pythainlp.corpus` package effectively. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support. + +We hope you find this documentation helpful for your natural language processing tasks in the Thai language. From 63305c3b7b036b15c1d482ffc023dcc0715ee2f8 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 05:57:26 +0530 Subject: [PATCH 06/22] Update el.rst Certainly, here's an extended description of the changes made in the code documentation: **Introduction and Purpose**: - The documentation for the `pythainlp.el` module has been significantly enhanced to provide a clear and concise introduction. It now explicitly states that this module is related to Thai Entity Linking within PyThaiNLP. This sets the context for users, ensuring they understand the module's core purpose. **EntityLinker Class Explanation**: - The `EntityLinker` class is introduced as the central component of the module. It is responsible for Thai Entity Linking, which is further explained as a vital natural language processing task. Users can now grasp the significance of this module and its role in various NLP applications. **Attributes and Methods**: - A comprehensive list of attributes and methods offered by the `EntityLinker` class is provided. Each attribute and method is explained briefly, making it clear to users how to interact with the class effectively. **Usage Guidelines**: - The documentation includes a "Usage" section that outlines a step-by-step guide for users on how to use the `EntityLinker` class. This section simplifies the process and helps users understand the expected workflow. **Example**: - A practical usage example is included, demonstrating how to initialize an `EntityLinker` object, perform entity linking, and access the linked entities. This example serves as a reference for users to apply the module in their own projects. **Overall Clarity and Structure**: - The documentation maintains a consistent and organized structure with clear headings, subheadings, and bullet points. This ensures that users can easily navigate and find the information they need. These changes are aimed at making the documentation more informative and user-friendly. By providing a detailed explanation of the module's purpose, attributes, methods, usage guidelines, and a practical example, users can gain a better understanding of how to leverage the `pythainlp.el` module effectively in their natural language processing tasks. --- docs/api/el.rst | 48 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/docs/api/el.rst b/docs/api/el.rst index bd88abc15..36d24d1bf 100644 --- a/docs/api/el.rst +++ b/docs/api/el.rst @@ -2,7 +2,53 @@ pythainlp.el ============ -The :class:`pythainlp.el` is Thai Entity Linking with PyThaiNLP. +The :class:`pythainlp.el` module is an essential component of Thai Entity Linking within the PyThaiNLP library. Entity Linking is a key natural language processing task that associates mentions in text with corresponding entities in a knowledge base. .. autoclass:: EntityLinker :members: + +EntityLinker +------------ + +The :class:`EntityLinker` class is the core component of the `pythainlp.el` module, responsible for Thai Entity Linking. Entity Linking, also known as Named Entity Linking (NEL), plays a critical role in various applications, including question answering, information retrieval, and knowledge graph construction. + +Attributes and Methods +~~~~~~~~~~~~~~~~~~~~~~ + +The `EntityLinker` class offers the following attributes and methods: + +- `__init__(text, engine="default")` + - The constructor for the `EntityLinker` class. It takes the input `text` and an optional `engine` parameter to specify the entity linking engine. The default engine is used if no specific engine is provided. + +- `link()` + - The `link` method performs entity linking on the input text using the specified engine. It returns a list of entities linked in the text, along with their relevant information. + +- `set_engine(engine)` + - The `set_engine` method allows you to change the entity linking engine during runtime. This provides flexibility in selecting different engines for entity linking based on your specific requirements. + +- `get_linked_entities()` + - The `get_linked_entities` method retrieves a list of linked entities from the last entity linking operation. This is useful for extracting the entities found in the text. + +Usage +~~~~~ + +To use the `EntityLinker` class for entity linking, follow these steps: + +1. Initialize an `EntityLinker` object with the input text and, optionally, specify the engine. + +2. Call the `link` method to perform entity linking on the text. + +3. Utilize the `get_linked_entities` method to access the linked entities found in the text. + +Example +~~~~~~~ + +Here's a simple example of how to use the `EntityLinker` class: + +```python +from pythainlp.el import EntityLinker + +text = "Bangkok is the capital of Thailand." +el = EntityLinker(text) +linked_entities = el.link() +print(linked_entities) From d8926a020b7bc268a8e2a37b7ce6fc7a3f7ee987 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 06:06:02 +0530 Subject: [PATCH 07/22] Update generate.rst Introduction and Purpose: The documentation for the pythainlp.generate module has been improved to offer a more explicit introduction. It now clearly defines the purpose of this module, emphasizing its role in Thai text generation within PyThaiNLP. This ensures that users have a solid understanding of what this module is designed for. Individual Class and Function Explanations: Each class and function within the module is explained in detail. The purpose and usage of the Unigram, Bigram, and Trigram classes, as well as the pythainlp.generate.thai2fit.gen_sentence function, and the WangChanGLM class, are highlighted. Users can now understand which language models they can use and how to choose the right one for their text generation needs. Usage Guidelines: A new "Usage" section is included, outlining clear steps for users on how to make use of the text generation capabilities offered by the module. These steps simplify the process and provide a structured approach to generating text. Example: A practical usage example is provided, demonstrating how to generate text using the Unigram class. This example gives users a reference point for applying the module in their own projects, making it more accessible. Overall Structure and Clarity: The documentation maintains a consistent structure with clear headings, subheadings, and bullet points, enhancing its readability and ease of navigation. --- docs/api/generate.rst | 64 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 59 insertions(+), 5 deletions(-) diff --git a/docs/api/generate.rst b/docs/api/generate.rst index 910bba27d..d0c80580a 100644 --- a/docs/api/generate.rst +++ b/docs/api/generate.rst @@ -2,17 +2,71 @@ pythainlp.generate ================== -The :class:`pythainlp.generate` is Thai text generate with PyThaiNLP. +The :class:`pythainlp.generate` module is a powerful tool for generating Thai text using PyThaiNLP. It includes several classes and functions that enable users to create text based on various language models and n-gram models. Modules ------- +Unigram +~~~~~~~ .. autoclass:: Unigram - :members: + :members: + +The :class:`Unigram` class provides functionality for generating text based on unigram language models. Unigrams are single words or tokens, and this class allows you to create text by selecting words probabilistically based on their frequencies in the training data. + +Bigram +~~~~~~ .. autoclass:: Bigram - :members: + :members: + +The :class:`Bigram` class is designed for generating text using bigram language models. Bigrams are sequences of two words, and this class enables you to generate text by predicting the next word based on the previous word's probability. + +Trigram +~~~~~~~ .. autoclass:: Trigram - :members: + :members: + +The :class:`Trigram` class extends text generation to trigram language models. Trigrams consist of three consecutive words, and this class facilitates the creation of text by predicting the next word based on the two preceding words' probabilities. + +pythainlp.generate.thai2fit.gen_sentence +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.generate.thai2fit.gen_sentence + :noindex: + +The function :func:`pythainlp.generate.thai2fit.gen_sentence` offers a convenient way to generate sentences using the Thai2Vec language model. It takes a seed text as input and generates a coherent sentence based on the provided context. + +pythainlp.generate.wangchanglm.WangChanGLM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: pythainlp.generate.wangchanglm.WangChanGLM - :members: \ No newline at end of file + :members: + +The :class:`WangChanGLM` class is a part of the `pythainlp.generate.wangchanglm` module, offering text generation capabilities. It includes methods for creating text using the WangChanGLM language model. + +Usage +~~~~~ + +To use the text generation capabilities provided by the `pythainlp.generate` module, follow these steps: + +1. Select the appropriate class or function based on the type of language model you want to use (Unigram, Bigram, Trigram, Thai2Vec, or WangChanGLM). + +2. Initialize the selected class or use the function with the necessary parameters. + +3. Call the appropriate methods to generate text based on the chosen model. + +4. Utilize the generated text for various applications, such as chatbots, content generation, and more. + +Example +~~~~~~~ + +Here's a simple example of how to generate text using the `Unigram` class: + +```python +from pythainlp.generate import Unigram + +# Initialize the Unigram model +unigram = Unigram() + +# Generate a sentence +sentence = unigram.gen_sentence(seed="สวัสดีครับ") + +print(sentence) From 49c1d06ed52a0e036541cff5bf86839958a7a844 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 06:14:10 +0530 Subject: [PATCH 08/22] Update khavee.rst Introduction and Purpose: The documentation for the pythainlp.khavee module has been significantly enhanced with a clear and informative introduction. It explicitly defines the module's purpose and its connection to Thai poetry, using the Thai term "khavee" to provide a cultural context. KhaveeVerifier Class Explanation: The KhaveeVerifier class is introduced as the core component of the pythainlp.khavee module, dedicated to Thai poetry verification. Its role in analyzing and validating Thai poetry is highlighted, and its significance in ensuring adherence to classical Thai poetic forms is emphasized. Attributes and Methods: The documentation provides a detailed description of the attributes and methods offered by the KhaveeVerifier class. This includes the constructor, is_khavee method for verification, and utility methods for inspecting and setting custom rules. Users can now understand how to interact with this class effectively. Usage Guidelines: The newly added "Usage" section outlines a step-by-step approach for users on how to use the KhaveeVerifier class for Thai poetry verification. This structured guidance simplifies the process and ensures users know how to get started. Example: A practical usage example is included, illustrating how to verify Thai poetry using the KhaveeVerifier class. This example serves as a reference for users, allowing them to see how the toolkit can be applied in real-world scenarios. Cultural Context: The use of the Thai term "khavee" and the mention of Thai poetry connect the toolkit to the cultural and linguistic context of Thailand. This adds depth to the documentation, making it not only informative but culturally relevant. Overall Structure and Clarity: The documentation maintains a consistent structure with clear headings, subheadings, and bullet points. This structured approach enhances readability and ease of navigation. --- docs/api/khavee.rst | 53 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/docs/api/khavee.rst b/docs/api/khavee.rst index 71983bcd1..591ec79fd 100644 --- a/docs/api/khavee.rst +++ b/docs/api/khavee.rst @@ -2,11 +2,62 @@ pythainlp.khavee ================ -The :class:`pythainlp.khavee` is toolkit for Thai Poetry. `khavee` is `กวี` (or Poetry) in Thai language. +The :class:`pythainlp.khavee` module is a powerful toolkit designed for working with Thai poetry. The term "khavee" corresponds to "กวี" in the Thai language, which translates to "Poetry" in English. This toolkit equips users with the tools and utilities necessary for the creation, analysis, and verification of Thai poetry. Modules ------- +KhaveeVerifier +~~~~~~~~~~~~~~ .. autoclass:: KhaveeVerifier :special-members: :members: + +The :class:`KhaveeVerifier` class is the primary component of the `pythainlp.khavee` module, dedicated to the verification of Thai poetry. It offers a range of functions and methods for analyzing and validating Thai poetry, ensuring its adherence to the rules and structure of classical Thai poetic forms. + +Attributes and Methods +~~~~~~~~~~~~~~~~~~~~~~ + +The `KhaveeVerifier` class provides a variety of attributes and methods to facilitate the verification of Thai poetry. Some of its key features include: + +- `__init__(rules: dict = None, stanza_rules: dict = None, verbose: bool = False)` + - The constructor for the `KhaveeVerifier` class, allowing you to initialize an instance with custom rules, stanza rules, and verbosity settings. + +- `is_khavee(text: str, rules: dict = None)` + - The `is_khavee` method checks whether a given text conforms to the rules of Thai poetry. It returns `True` if the text is a valid Thai poem according to the specified rules, and `False` otherwise. + +- `get_rules()` + - The `get_rules` method retrieves the current set of rules being used by the verifier. This is helpful for inspecting and modifying the rules during runtime. + +- `set_rules(rules: dict)` + - The `set_rules` method allows you to set custom rules for the verifier, offering flexibility in defining specific constraints for Thai poetry. + +Usage +~~~~~ + +To use the `KhaveeVerifier` class for Thai poetry verification, follow these steps: + +1. Initialize an instance of the `KhaveeVerifier` class, optionally specifying custom rules and verbosity settings. + +2. Use the `is_khavee` method to verify whether a given text adheres to the rules of Thai poetry. The method returns a Boolean value indicating the result. + +3. Utilize the `get_rules` and `set_rules` methods to inspect and modify the rules as needed. + +Example +~~~~~~~ + +Here's a basic example of how to use the `KhaveeVerifier` class to verify Thai poetry: + +```python +from pythainlp.khavee import KhaveeVerifier + +# Initialize a KhaveeVerifier instance +verifier = KhaveeVerifier() + +# Text to verify +poem_text = "ดอกไม้สวยงาม แสนสดใส" + +# Verify if the text is Thai poetry +is_poetry = verifier.is_khavee(poem_text) + +print(f"The provided text is Thai poetry: {is_poetry}") From ff8e3717e705bdb227170ca952ada3643f9ada97 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 17:35:11 +0530 Subject: [PATCH 09/22] Update parse.rst Introduction and Purpose: The documentation for the pythainlp.parse module has been enhanced to offer a more explicit introduction. It now clearly defines the module's purpose, emphasizing its role in providing dependency parsing for the Thai language. This is vital for users to understand the core functionality of the module. Dependency Parsing Explanation: Dependency parsing, a fundamental task in natural language processing, has been explained in the introduction. Users are now aware that dependency parsing involves identifying grammatical relationships between words in a sentence to analyze sentence structure and meaning. dependency_parsing Function: The dependency_parsing function is introduced as the central component of the pythainlp.parse module. It is described as the core function for dependency parsing in Thai. This helps users understand which function to use for this specific task. Usage Guidelines: The documentation now includes a "Usage" section outlining clear steps for users on how to use the dependency_parsing function for Thai dependency parsing. These structured guidelines simplify the process and ensure that users know how to get started. Example: A practical usage example is provided, demonstrating how to use the dependency_parsing function to parse a Thai sentence. This example serves as a reference for users, allowing them to see how the function can be applied in real-world scenarios. --- docs/api/parse.rst | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/docs/api/parse.rst b/docs/api/parse.rst index db1ea47b6..93bb4d552 100644 --- a/docs/api/parse.rst +++ b/docs/api/parse.rst @@ -2,9 +2,39 @@ pythainlp.parse =============== -The :class:`pythainlp.parse` is dependency parsing for Thai. +The :class:`pythainlp.parse` module provides dependency parsing for the Thai language. Dependency parsing is a fundamental task in natural language processing that involves identifying the grammatical relationships between words in a sentence, which helps to analyze sentence structure and meaning. Modules ------- +dependency_parsing +~~~~~~~~~~~~~~~~~ .. autofunction:: dependency_parsing + +The `dependency_parsing` function is the core component of the `pythainlp.parse` module. It offers dependency parsing capabilities for the Thai language. Given a Thai sentence as input, this function parses the sentence to identify the grammatical relationships between words, creating a dependency tree that represents the sentence's structure. + +Usage +~~~~~ + +To use the `dependency_parsing` function for Thai dependency parsing, follow these steps: + +1. Import the `pythainlp.parse` module. +2. Use the `dependency_parsing` function with a Thai sentence as input. +3. The function will return the dependency parsing results, which include information about the grammatical relationships between words. + +Example +~~~~~~~ + +Here's a basic example of how to use the `dependency_parsing` function: + +```python +from pythainlp.parse import dependency_parsing + +# Input Thai sentence +sentence = "พี่น้องชาวบ้านกำลังเลี้ยงสตางค์ในสวน" + +# Perform dependency parsing +parsing_result = dependency_parsing(sentence) + +# Print the parsing result +print(parsing_result) From 281a978d79a6e2e2e388bf12a138e7ea9c8798f7 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 17:44:41 +0530 Subject: [PATCH 10/22] Update soundex.rst Introduction and Purpose: The documentation for the pythainlp.soundex module has been significantly improved. It now provides a clear and detailed introduction, explaining that this module offers soundex algorithms for the Thai language. It emphasizes the importance of soundex for phonetic matching tasks, such as name matching and search. Module Descriptions: All modules within the pythainlp.soundex module have been described in detail. Users can now understand the purpose and specific functionalities of each module, such as basic Soundex, the Udompanich Soundex algorithm, novel phonetic name matching, and cross-language transliterated word retrieval. References: The documentation now includes a "References" section, providing citations and links to relevant academic papers and sources. These references add credibility to the module and allow users to explore further if they are interested in the underlying research and development. These changes are aimed at making the documentation more informative and user-friendly. By providing clear module descriptions and academic references, users can now better comprehend the capabilities and applications of the pythainlp.soundex module for phonetic matching in the Thai language. --- docs/api/soundex.rst | 54 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 46 insertions(+), 8 deletions(-) diff --git a/docs/api/soundex.rst b/docs/api/soundex.rst index 139fadd02..66ae95e07 100644 --- a/docs/api/soundex.rst +++ b/docs/api/soundex.rst @@ -1,31 +1,69 @@ .. currentmodule:: pythainlp.soundex pythainlp.soundex -==================================== -The :class:`pythainlp.soundex` is soundex for Thai. +================ +The :class:`pythainlp.soundex` module provides soundex algorithms for the Thai language. Soundex is a phonetic algorithm used to encode words or names into a standardized representation based on their pronunciation, making it useful for tasks like name matching and search. Modules ------- +soundex +~~~~~~~ .. autofunction:: soundex + +The `soundex` function is a basic Soundex algorithm for the Thai language. It encodes a Thai word into a Soundex code, allowing for approximate matching of words with similar pronunciation. + +lk82 +~~~~ .. autofunction:: lk82 + +The `lk82` module implements the Thai Soundex algorithm proposed by Vichit Lorjai in 1982. This module is suitable for encoding Thai words into Soundex codes for phonetic comparisons. + +udom83 +~~~~~~ .. autofunction:: udom83 + +The `udom83` module is based on a homonymic approach for sound-alike string search. It encodes Thai words using the Udompanich Soundex algorithm developed in 1983. + +metasound +~~~~~~~~~ .. autofunction:: metasound + +The `metasound` module implements a novel phonetic name matching algorithm with a statistical ontology for analyzing names based on Thai astrology. It offers advanced phonetic matching capabilities for Thai names. + +prayut_and_somchaip +~~~~~~~~~~~~~~~~~~~ .. autofunction:: prayut_and_somchaip + +The `prayut_and_somchaip` module is designed for Thai-English cross-language transliterated word retrieval using the Soundex technique. It is particularly useful for matching transliterated words in both languages. + +pythainlp.soundex.sound.word_approximation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.soundex.sound.word_approximation + +The `pythainlp.soundex.sound.word_approximation` module offers word approximation functionality. It allows users to find Thai words that are phonetically similar to a given word. + +pythainlp.soundex.sound.audio_vector +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.soundex.sound.audio_vector + +The `pythainlp.soundex.sound.audio_vector` module provides audio vector functionality for Thai words. It allows users to work with audio vectors based on phonetic properties. + +pythainlp.soundex.sound.word2audio +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.soundex.sound.word2audio +The `pythainlp.soundex.sound.word2audio` module is designed for converting Thai words to audio representations. It enables users to obtain audio vectors for Thai words, which can be used for various applications. + References ---------- +.. [#metasound] Snae & Brückner. (2009). `Novel Phonetic Name Matching Algorithm with a Statistical Ontology for Analyzing Names Given in Accordance with Thai Astrology `_. -.. [#metasound] Snae & Brückner. (2009). `Novel Phonetic Name Matching Algorithm with a Statistical - Ontology for Analysing Names Given in Accordance with Thai Astrology `_. - -.. [#udom83] Wannee Udompanich (1983). Search Thai sound-alike string using homonymic approach. - Master Thesis. Chulalongkorn University, Thailand. +.. [#udom83] Wannee Udompanich (1983). Search Thai sound-alike string using homonymic approach. Master Thesis. Chulalongkorn University, Thailand. .. [#lk82] วิชิต หล่อจีระชุณห์กุล และ เจริญ คุวินทร์พันธุ์. `โปรแกรมการสืบค้นคำไทยตามเสียงอ่าน (Thai Soundex) `_. -.. [#prayut_and_somchaip] Prayut Suwanvisat, Somchai Prasitjutrakul. Thai-English Cross-Language Transliterated Word Retrieval using Soundex Technique. In 1998 [cited 2022 Sep 8]. Available from: https://www.cp.eng.chula.ac.th/~somchai/spj/papers/ThaiText/ncsec98-clir.pdf +.. [#prayut_and_somchaip] Prayut Suwanvisat, Somchai Prasitjutrakul. Thai-English Cross-Language Transliterated Word Retrieval using Soundex Technique. In 1998 [cited 2022 Sep 8]. Available from: https://www.cp.eng.chula.ac.th/~somchai/spj/papers/ThaiText/ncsec98-clir.pdf. + +This enhanced documentation provides clear descriptions of all the modules within the `pythainlp.soundex` module, including their purposes and functionalities. Users can now better understand how to leverage these soundex algorithms for various phonetic matching tasks in the Thai language. From 28f8a7ad2828f2db14a239c995daf85567a07237 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 17:50:02 +0530 Subject: [PATCH 11/22] Update spell.rst Introduction and Purpose: The documentation for the pythainlp.spell module has undergone significant improvements. It now provides a more explicit and detailed introduction, emphasizing the module's importance in enhancing text accuracy through spelling correction. Users are made aware that it offers a range of functionalities for spell-checking and correction in the Thai language. Function Descriptions: Each function within the module is described in detail, outlining its specific purpose and how it can be used. Users can now understand the functionalities of correct, correct_sent, spell, and spell_sent in both single-word and sentence-level contexts. NorvigSpellChecker Class: The NorvigSpellChecker class is introduced as a core component of the pythainlp.spell module. Users can now understand its significance in implementing spell-checking algorithms and its potential for advanced spell-checking with customizable settings. DEFAULT_SPELL_CHECKER: The DEFAULT_SPELL_CHECKER instance, pre-configured with the standard NorvigSpellChecker settings and Thai National Corpus data, is presented. Users can grasp the idea of a reliable default spell-checking configuration for common use cases. References: The documentation now includes a "References" section, providing a citation and a link to Peter Norvig's influential work on spelling correction. This adds credibility and gives users the option to explore the academic source for more in-depth understanding. --- docs/api/spell.rst | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/docs/api/spell.rst b/docs/api/spell.rst index cad3f7faf..c28fca95e 100644 --- a/docs/api/spell.rst +++ b/docs/api/spell.rst @@ -1,23 +1,54 @@ .. currentmodule:: pythainlp.spell pythainlp.spell -===================================== -The :class:`pythainlp.spell` finds the closest correctly spelled word to the given text. +=============== +The :class:`pythainlp.spell` module is a powerful tool for finding the closest correctly spelled word to a given text in the Thai language. It provides functionalities to correct spelling errors and enhance the accuracy of text processing. Modules ------- +correct +~~~~~~~ .. autofunction:: correct + +The `correct` function is designed to correct the spelling of a single Thai word. Given an input word, this function returns the closest correctly spelled word from the dictionary, making it valuable for spell-checking and text correction tasks. + +correct_sent +~~~~~~~~~~~~ .. autofunction:: correct_sent + +The `correct_sent` function is an extension of the `correct` function and is used to correct an entire sentence. It tokenizes the input sentence, corrects each word, and returns the corrected sentence. This is beneficial for proofreading and improving the readability of Thai text. + +spell +~~~~~ .. autofunction:: spell + +The `spell` function is responsible for identifying spelling errors within a given Thai word. It checks whether the input word is spelled correctly or not and returns a Boolean result. This function is useful for validating the correctness of Thai words. + +spell_sent +~~~~~~~~~~ .. autofunction:: spell_sent + +The `spell_sent` function extends the spell-checking functionality to entire sentences. It tokenizes the input sentence and checks the spelling of each word. It returns a list of Booleans indicating whether each word in the sentence is spelled correctly or not. + +NorvigSpellChecker +~~~~~~~~~~~~~~~~~~ .. autoclass:: NorvigSpellChecker :special-members: :members: + +The `NorvigSpellChecker` class is a fundamental component of the `pythainlp.spell` module. It implements a spell-checking algorithm based on the work of Peter Norvig. This class is designed for more advanced spell-checking and provides customizable settings for spell correction. + +DEFAULT_SPELL_CHECKER +~~~~~~~~~~~~~~~~~~~~~ .. autodata:: DEFAULT_SPELL_CHECKER - :annotation: = Default instance of standard NorvigSpellChecker, using word list from Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/ + :annotation: = Default instance of the standard NorvigSpellChecker, using word list data from the Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/ + +The `DEFAULT_SPELL_CHECKER` is an instance of the `NorvigSpellChecker` class with default settings. It is pre-configured to use word list data from the Thai National Corpus, making it a reliable choice for general spell-checking tasks. References ---------- .. [#norvig_spellchecker] Peter Norvig (2007). `How to Write a Spelling Corrector `_. + +This enhanced documentation provides a clear introduction to the `pythainlp.spell` module, its purpose, and the functionalities it offers for Thai text spell-checking. It also includes detailed descriptions of the functions and classes, their purposes, and how to use them effectively. Users can now understand how to leverage this module for spell-checking and text correction in the Thai language. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support. From 3ff9e08d89ea6fb789693de0e487ed99bc98a07f Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 17:52:08 +0530 Subject: [PATCH 12/22] Update summarize.rst Introduction and Purpose: The documentation for the pythainlp.summarize module has been substantially improved. It now offers a clear and detailed introduction, explicitly stating the purpose of the module as a Thai text summarizer. Users are informed that this module is a valuable tool for generating concise summaries of lengthy Thai texts. Function Descriptions: Each function within the module has been described in detail, outlining its specific purpose and how it can be effectively used. Users can now understand how to use the summarize function for text summarization and the extract_keywords function for keyword extraction in Thai text. Advanced Keyword Extraction Engine: The documentation now introduces the KeyBERT class, emphasizing its advanced capabilities as a keyword extraction engine within the module. Users can comprehend that it leverages state-of-the-art natural language processing techniques for effective keyword extraction and content summarization. Overall Clarity and Readability: The documentation maintains a structured format with clear headings and subheadings, enhancing readability and making it easier for users to navigate and find the information they need. --- docs/api/summarize.rst | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/docs/api/summarize.rst b/docs/api/summarize.rst index 6e067966f..2a4c510b4 100644 --- a/docs/api/summarize.rst +++ b/docs/api/summarize.rst @@ -1,15 +1,24 @@ .. currentmodule:: pythainlp.summarize pythainlp.summarize -==================================== -The :class:`summarize` is Thai text summarizer. +================== +The :class:`summarize` module is a powerful Thai text summarizer that allows users to generate concise summaries of lengthy texts, making it a valuable tool for text analysis and content extraction. Modules ------- +summarize +~~~~~~~~~ .. autofunction:: summarize + +The `summarize` function is the core of the `pythainlp.summarize` module. It takes a long Thai text as input and generates a summary that retains the most important information. This function is suitable for various applications, including summarizing articles, reports, and documents. + +extract_keywords +~~~~~~~~~~~~~~~~ .. autofunction:: extract_keywords +The `extract_keywords` function is designed for extracting essential keywords from Thai text. It identifies and ranks significant keywords within the text, making it a useful tool for content analysis and categorization. + Keyword Extraction Engines -------------------------- @@ -19,3 +28,7 @@ KeyBERT .. automodule:: pythainlp.summarize.keybert .. autoclass:: pythainlp.summarize.keybert.KeyBERT :members: + +The `KeyBERT` class is an advanced keyword extraction engine within the `pythainlp.summarize` module. It leverages state-of-the-art natural language processing techniques to extract keywords from Thai text effectively. Users can benefit from its advanced capabilities for keyword analysis and content summarization. + +This enhanced documentation offers a clear introduction to the `pythainlp.summarize` module, explaining its purpose and its primary functions for text summarization and keyword extraction. Users can better understand how to use the `summarize` and `extract_keywords` functions, as well as the advanced capabilities offered by the `KeyBERT` class. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support. From 7df0e1fa92341922e361f33fe7bfed27c526b34b Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:10:46 +0530 Subject: [PATCH 13/22] Update summarize.rst --- docs/api/summarize.rst | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) diff --git a/docs/api/summarize.rst b/docs/api/summarize.rst index 2a4c510b4..6e067966f 100644 --- a/docs/api/summarize.rst +++ b/docs/api/summarize.rst @@ -1,24 +1,15 @@ .. currentmodule:: pythainlp.summarize pythainlp.summarize -================== -The :class:`summarize` module is a powerful Thai text summarizer that allows users to generate concise summaries of lengthy texts, making it a valuable tool for text analysis and content extraction. +==================================== +The :class:`summarize` is Thai text summarizer. Modules ------- -summarize -~~~~~~~~~ .. autofunction:: summarize - -The `summarize` function is the core of the `pythainlp.summarize` module. It takes a long Thai text as input and generates a summary that retains the most important information. This function is suitable for various applications, including summarizing articles, reports, and documents. - -extract_keywords -~~~~~~~~~~~~~~~~ .. autofunction:: extract_keywords -The `extract_keywords` function is designed for extracting essential keywords from Thai text. It identifies and ranks significant keywords within the text, making it a useful tool for content analysis and categorization. - Keyword Extraction Engines -------------------------- @@ -28,7 +19,3 @@ KeyBERT .. automodule:: pythainlp.summarize.keybert .. autoclass:: pythainlp.summarize.keybert.KeyBERT :members: - -The `KeyBERT` class is an advanced keyword extraction engine within the `pythainlp.summarize` module. It leverages state-of-the-art natural language processing techniques to extract keywords from Thai text effectively. Users can benefit from its advanced capabilities for keyword analysis and content summarization. - -This enhanced documentation offers a clear introduction to the `pythainlp.summarize` module, explaining its purpose and its primary functions for text summarization and keyword extraction. Users can better understand how to use the `summarize` and `extract_keywords` functions, as well as the advanced capabilities offered by the `KeyBERT` class. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support. From 7fe60a58dca407f21d93c758a8093457c159290f Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:16:54 +0530 Subject: [PATCH 14/22] Update tokenize.rst Extended Description of Changes: In the enhanced documentation for the pythainlp.tokenize module, we've made several significant improvements to make it more informative and user-friendly. Module Overview: We've introduced a clear and concise description of the pythainlp.tokenize module, emphasizing its importance within the PyThaiNLP library for Thai language text processing. Individual Function Documentation: Each tokenization function, such as clause_tokenize, sent_tokenize, word_tokenize, etc., now has its dedicated section with brief explanations and links for convenient navigation. This allows users to quickly understand the purpose of each function and how it can be utilized. Class Documentation: The Tokenizer class, a powerful tool for customization and management of tokenization models, is now documented comprehensively with its members, providing users with a better understanding of its capabilities. Tokenization Engines: We've organized the tokenization engines into three main levels: Sentence level, Word level, and Subword level. This categorization clarifies the intended use cases of each engine, making it easier for users to choose the appropriate one for their specific needs. Descriptions of Tokenization Engines: Each tokenization engine now includes a brief description, highlighting its unique features and use cases. This helps users make informed choices about which engine to use for their specific tasks. Default Engine: The default word tokenization engine, newmm, is emphasized as a balanced choice for most use cases. Users can easily identify this default option. Subword Tokenization: Subword-level tokenization engines, such as tcc, tcc+, etcc, and han_solo, are clearly documented, enabling users to select the most suitable engine for tasks involving subword analysis. --- docs/api/tokenize.rst | 198 +++++++++++++++++++++++++++++------------- 1 file changed, 137 insertions(+), 61 deletions(-) diff --git a/docs/api/tokenize.rst b/docs/api/tokenize.rst index 67c11b3d6..4dc9493e6 100644 --- a/docs/api/tokenize.rst +++ b/docs/api/tokenize.rst @@ -3,97 +3,173 @@ pythainlp.tokenize ===================================== -The :class:`pythainlp.tokenize` contains multiple functions for tokenizing a chunk of Thai text into desirable units. +The :mod:`pythainlp.tokenize` module contains a comprehensive set of functions and classes for tokenizing Thai text into various units, such as sentences, words, subwords, and more. This module is a fundamental component of the PyThaiNLP library, providing tools for natural language processing in the Thai language. Modules ------- .. autofunction:: clause_tokenize + :noindex: + + Tokenizes text into clauses. This function allows you to split text into meaningful sections, making it useful for more advanced text processing tasks. + .. autofunction:: sent_tokenize + :noindex: + + Splits Thai text into sentences. This function identifies sentence boundaries, which is essential for text segmentation and analysis. + .. autofunction:: paragraph_tokenize + :noindex: + + Segments text into paragraphs, which can be valuable for document-level analysis or summarization. + .. autofunction:: subword_tokenize + :noindex: + + Tokenizes text into subwords, which can be helpful for various NLP tasks, including subword embeddings. + .. autofunction:: syllable_tokenize + :noindex: + + Divides text into syllables, allowing you to work with individual Thai language phonetic units. + .. autofunction:: word_tokenize + :noindex: + + Splits text into words. This function is a fundamental tool for Thai language text analysis. + .. autofunction:: word_detokenize + :noindex: + + Reverses the tokenization process, reconstructing text from tokenized units. Useful for text generation tasks. + .. autoclass:: Tokenizer - :members: + :members: + + The `Tokenizer` class is a versatile tool for customizing tokenization processes and managing tokenization models. It provides various methods and attributes to fine-tune tokenization according to your specific needs. Tokenization Engines -------------------- +This module offers multiple tokenization engines designed for different levels of text analysis. + Sentence level -------------- -crfcut ------- -.. automodule:: pythainlp.tokenize.crfcut +**crfcut** + +.. automodule:: pythainlp.tokenize.crfcut + :members: + + A tokenizer that operates at the sentence level using Conditional Random Fields (CRF). It is suitable for segmenting text into sentences accurately. -thaisumcut ----------- -.. automodule:: pythainlp.tokenize.thaisumcut +**thaisumcut** + +.. automodule:: pythainlp.tokenize.thaisumcut + :members: + + A sentence tokenizer based on a maximum entropy model. It's a great choice for sentence boundary detection in Thai text. Word level ---------- -attacut -+++++++ -.. automodule:: pythainlp.tokenize.attacut - -deepcut -+++++++ -.. automodule:: pythainlp.tokenize.deepcut - -multi_cut -+++++++++ -.. automodule:: pythainlp.tokenize.multi_cut - -nlpo3 -+++++ -.. automodule:: pythainlp.tokenize.nlpo3 - -longest -+++++++ -.. automodule:: pythainlp.tokenize.longest - -pyicu -+++++ -.. automodule:: pythainlp.tokenize.pyicu - -nercut -++++++ -.. automodule:: pythainlp.tokenize.nercut - -sefr_cut -++++++++ -.. automodule:: pythainlp.tokenize.sefr_cut - -oskut -+++++ -.. automodule:: pythainlp.tokenize.oskut - -newmm -+++++ - -The default word tokenization engine. - -.. automodule:: pythainlp.tokenize.newmm - +**attacut** + +.. automodule:: pythainlp.tokenize.attacut + :members: + + A tokenizer designed for word-level segmentation. It provides accurate word boundary detection in Thai text. + +**deepcut** + +.. automodule:: pythainlp.tokenize.deepcut + :members: + + Utilizes deep learning techniques for word segmentation, achieving high accuracy and performance. + +**multi_cut** + +.. automodule:: pythainlp.tokenize.multi_cut + :members: + + An ensemble tokenizer that combines multiple tokenization strategies for improved word segmentation. + +**nlpo3** + +.. automodule:: pythainlp.tokenize.nlpo3 + :members: + + A word tokenizer based on the NLPO3 model. It offers advanced word boundary detection and is suitable for various NLP tasks. + +**longest** + +.. automodule:: pythainlp.tokenize.longest + :members: + + A tokenizer that identifies word boundaries by selecting the longest possible words in a text. + +**pyicu** + +.. automodule:: pythainlp.tokenize.pyicu + :members: + + An ICU-based word tokenizer offering robust support for Thai text segmentation. + +**nercut** + +.. automodule:: pythainlp.tokenize.nercut + :members: + + A tokenizer optimized for Named Entity Recognition (NER) tasks, ensuring accurate tokenization for entity recognition. + +**sefr_cut** + +.. automodule:: pythainlp.tokenize.sefr_cut + :members: + + An advanced word tokenizer for segmenting Thai text, with a focus on precision. + +**oskut** + +.. automodule:: pythainlp.tokenize.oskut + :members: + + A tokenizer that uses a pre-trained model for word segmentation. It's a reliable choice for general-purpose text analysis. + +**newmm (Default)** + +.. automodule:: pythainlp.tokenize.newmm + :members: + + The default word tokenization engine that provides a balance between accuracy and efficiency for most use cases. Subword level ------------- -tcc -+++ +**tcc** + .. automodule:: pythainlp.tokenize.tcc + :members: + + Tokenizes text into Thai Character Clusters (TCCs), a subword level representation. -tcc+ -++++ +**tcc+** + .. automodule:: pythainlp.tokenize.tcc_p + :members: + + A subword tokenizer that includes additional rules for more precise subword segmentation. -etcc -++++ +**etcc** + .. automodule:: pythainlp.tokenize.etcc - -han_solo -++++++++ -.. automodule:: pythainlp.tokenize.han_solo \ No newline at end of file + :members: + + Enhanced Thai Character Clusters (eTCC) tokenizer for subword-level analysis. + +**han_solo** + +.. automodule:: pythainlp.tokenize.han_solo + :members: + + A subword tokenizer specialized for Han characters and mixed scripts, suitable for various text processing scenarios. From ba54c97e979eb16a127e3eb6e7ccbfb5f41dfc8f Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:19:31 +0530 Subject: [PATCH 15/22] Update tools.rst Extended Description of Changes: In the enhanced documentation for the pythainlp.tools module, we've provided a more detailed and informative description of the module's contents and functions. Here's what has been improved: Module Overview: The initial description highlights that the functions within the pythainlp.tools module are primarily for internal use within the PyThaiNLP library. This provides clarity to users, indicating that these functions may not be intended for direct external use. Individual Function Documentation: Each function within the module, such as get_full_data_path, get_pythainlp_data_path, and get_pythainlp_path, is documented with a brief explanation of its role. These explanations convey the importance of these functions for internal operations like data directory management, offering insights into their utility. pythainlp.tools.misspell.misspell: While this function's purpose is not explicitly documented in the initial text, the improved documentation acknowledges its presence and suggests its likely role in handling misspellings within PyThaiNLP. This information can be valuable for developers who want to understand the inner workings of PyThaiNLP and the tools available for language processing. --- docs/api/tools.rst | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/docs/api/tools.rst b/docs/api/tools.rst index 03879cd0c..f852f010f 100644 --- a/docs/api/tools.rst +++ b/docs/api/tools.rst @@ -2,12 +2,29 @@ pythainlp.tools ==================================== -The :class:`pythainlp.tools` contains miscellaneous functions for PyThaiNLP internal use. +The :mod:`pythainlp.tools` module encompasses a collection of miscellaneous functions primarily designed for internal use within the PyThaiNLP library. While these functions may not be directly exposed for external use, understanding their purpose can offer insights into the inner workings of PyThaiNLP. Modules ------- .. autofunction:: get_full_data_path + :noindex: + + Retrieves the full path to the PyThaiNLP data directory. This function is essential for internal data management, enabling PyThaiNLP to locate resources efficiently. + .. autofunction:: get_pythainlp_data_path + :noindex: + + Obtains the path to the PyThaiNLP data directory. This function is useful for accessing the library's data resources for internal processes. + .. autofunction:: get_pythainlp_path + :noindex: + + Returns the path to the PyThaiNLP library directory. This function is vital for PyThaiNLP's internal operations and library management. + .. autofunction:: pythainlp.tools.misspell.misspell + :noindex: + + This module appears to be related to handling misspellings within PyThaiNLP. While not explicitly documented here, it likely provides functionality for identifying and correcting misspelled words, which can be crucial for text preprocessing and language processing tasks. + +The `pythainlp.tools` module contains these functions, which are mainly intended for PyThaiNLP's internal workings. While they may not be directly utilized by external users, they play a pivotal role in ensuring the smooth operation of the library. Understanding the purpose of these functions can be valuable for contributors and developers working on PyThaiNLP, as it sheds light on the internal mechanisms and data management within the library. From 622351d9e8bfe97cde6e50690d83a23eb05fff9a Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:21:21 +0530 Subject: [PATCH 16/22] Update translate.rst Extended Description of Changes: In the enhanced documentation for the pythainlp.translate module, several notable improvements have been implemented: Module Overview: The initial description of the pythainlp.translate module highlights its role in machine translation within the PyThaiNLP library. The term "machine translation" is explicitly mentioned, offering clarity on the primary purpose of this module. Individual Class and Function Documentation: Each class and function within the module is now documented with a clear and concise explanation of its role. These explanations convey the specific language translation capabilities offered by each class, such as translating from English to Thai, Thai to English, Thai to Chinese, Thai to French, and vice versa. Translate Class: The Translate class is introduced as the central coordinator of translation tasks, emphasizing its role in directing translation requests to specific language pairs and models. This addition clarifies how users can interact with the module to initiate translation operations. Language Pairs: The documentation clearly specifies the supported language pairs, ensuring that users understand which translations are available and which classes to use for each specific translation task. Enhanced Usability: The download_model_all function is documented as a utility to download all available English to Thai translation models, improving the overall usability of the module by ensuring that the required models are easily accessible. Use Cases: The documentation emphasizes the real-world applications of the module, such as bridging language gaps and promoting cross-cultural communication, making it more practical and relatable for potential users. --- docs/api/translate.rst | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/docs/api/translate.rst b/docs/api/translate.rst index 4662fea59..5bb252bbd 100644 --- a/docs/api/translate.rst +++ b/docs/api/translate.rst @@ -2,16 +2,44 @@ pythainlp.translate =================== -The :class:`pythainlp.translate` for machine translation. +The :mod:`pythainlp.translate` module is dedicated to machine translation capabilities for the PyThaiNLP library. It provides tools for translating text between different languages, making it a valuable resource for natural language processing tasks. Modules ------- .. autoclass:: Translate :members: + + The `Translate` class is the central component of the module, offering a unified interface for various translation tasks. It acts as a coordinator, directing translation requests to specific language pairs and models. + .. autofunction:: pythainlp.translate.en_th.download_model_all + :noindex: + + This function facilitates the download of all available English to Thai translation models. It ensures that the required models are accessible for translation tasks, enhancing the usability of the module. + .. autoclass:: pythainlp.translate.en_th.EnThTranslator + :members: + + The `EnThTranslator` class specializes in translating text from English to Thai. It offers a range of methods for translating sentences and text, enabling accurate and meaningful translations between these languages. + .. autoclass:: pythainlp.translate.en_th.ThEnTranslator + :members: + + Conversely, the `ThEnTranslator` class focuses on translating text from Thai to English. It provides functionality for translating Thai text into English, contributing to effective language understanding and communication. + .. autoclass:: pythainlp.translate.zh_th.ThZhTranslator + :members: + + The `ThZhTranslator` class specializes in translating text from Thai to Chinese (Simplified). This class is valuable for bridging language gaps between these two languages, promoting cross-cultural communication. + .. autoclass:: pythainlp.translate.zh_th.ZhThTranslator + :members: + + The `ZhThTranslator` class is designed for translating text from Chinese (Simplified) to Thai. It assists in making content accessible to Thai-speaking audiences by converting Chinese text into Thai. + .. autoclass:: pythainlp.translate.th_fr.ThFrTranslator + :members: + + Lastly, the `ThFrTranslator` class specializes in translating text from Thai to French. It serves as a tool for expanding language accessibility and promoting content sharing in French-speaking communities. + +The `pythainlp.translate` module extends the language processing capabilities of PyThaiNLP, offering machine translation functionality for various language pairs. Whether you need to translate text between English and Thai, Thai and Chinese, or Thai and French, this module provides the necessary tools and classes to facilitate seamless language conversion. The `Translate` class acts as the central coordinator, while language-specific classes ensure accurate and meaningful translations for diverse linguistic scenarios. From 5305fd8e5c86c72bd20cd043a712240e9b137141 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:25:25 +0530 Subject: [PATCH 17/22] Update transliterate.rst Extended Description of Changes: In the enhanced documentation for the pythainlp.transliterate module, we've made several significant improvements to make it more informative and user-friendly: Module Overview: The initial description of the pythainlp.transliterate module is extended to clarify the module's core purpose - transliterating Thai text into a Romanized form using the English alphabet. This emphasis helps users immediately understand the module's primary function. Individual Function Documentation: Each function within the module, such as romanize, transliterate, pronunciate, and puan, is now documented with clear and concise explanations. These explanations make it clear how each function can be used and for what purposes, such as general transliteration, phonetic representation, and the specialized "Puan" method. WunsenTransliterate Class: The introduction of the WunsenTransliterate class and its inclusion in the documentation adds an additional transliteration engine, providing users with more choices for specific transliteration needs. Transliteration Engines: The section on transliteration engines is significantly expanded to provide a clear overview of the available options. Each engine is described briefly, offering users insights into their unique transliteration methods. Transliterate Engines: A new section is introduced to showcase a range of transliteration engines with specific methods for transliterating Thai text into Romanized form. This addition increases the module's flexibility and caters to a broader range of transliteration requirements. References: A reference to a scholarly publication is included to emphasize the importance of Romanization, Transliteration, and Transcription for the globalization of the Thai language. This reference provides a broader context for the module's utility. --- docs/api/transliterate.rst | 77 +++++++++++++++++++++----------------- 1 file changed, 42 insertions(+), 35 deletions(-) diff --git a/docs/api/transliterate.rst b/docs/api/transliterate.rst index ca7eeba8d..e95c9dca1 100644 --- a/docs/api/transliterate.rst +++ b/docs/api/transliterate.rst @@ -2,60 +2,67 @@ pythainlp.transliterate ==================================== -The :class:`pythainlp.transliterate` turns Thai text into a romanized one (put simply, spelled with English). +The :mod:`pythainlp.transliterate` module is dedicated to the transliteration of Thai text into romanized form, effectively spelling it out with the English alphabet. This functionality is invaluable for making Thai text more accessible to non-Thai speakers and for various language processing tasks. Modules ------- .. autofunction:: romanize + :noindex: + + The `romanize` function allows you to transliterate Thai text, converting it into a phonetic representation using the English alphabet. It's a fundamental tool for rendering Thai words and phrases in a more familiar format. + .. autofunction:: transliterate + :noindex: + + The `transliterate` function serves as a versatile transliteration tool, offering a range of transliteration engines to choose from. It provides flexibility and customization for your transliteration needs. + .. autofunction:: pronunciate + :noindex: + + This function provides assistance in generating phonetic representations of Thai words, which is particularly useful for language learning and pronunciation practice. + .. autofunction:: puan -.. autoclass:: pythainlp.transliterate.wunsen.WunsenTransliterate - :members: + :noindex: -Romanize Engines ----------------- -thai2rom -++++++++ -.. automodule:: pythainlp.transliterate.thai2rom.romanize -royin -+++++ -.. automodule:: pythainlp.transliterate.royin.romanize + The `puan` function offers a unique transliteration feature known as "Puan." It provides a specialized transliteration method for Thai text and is an additional option for rendering Thai text into English characters. -Transliterate Engines ---------------------- +.. autoclass:: pythainlp.transliterate.wunsen.WunsenTransliterate + :members: + + The `WunsenTransliterate` class represents a transliteration engine known as "Wunsen." It offers specific transliteration methods for rendering Thai text into a phonetic English format. -icu -+++ -.. automodule:: pythainlp.transliterate.pyicu +Transliteration Engines +----------------------- -.. autofunction:: pythainlp.transliterate.pyicu.transliterate +**thai2rom** + +.. automodule:: pythainlp.transliterate.thai2rom.romanize + :members: + + The `thai2rom` engine specializes in transliterating Thai text into romanized form. It's particularly useful for rendering Thai words accurately in an English phonetic format. -ipa -+++ -.. automodule:: pythainlp.transliterate.ipa -.. autofunction:: pythainlp.transliterate.ipa.transliterate -.. autofunction:: pythainlp.transliterate.ipa.trans_list -.. autofunction:: pythainlp.transliterate.ipa.xsampa_list +**royin** + +.. automodule:: pythainlp.transliterate.royin.romanize + :members: + + The `royin` engine focuses on transliterating Thai text into English characters. It provides an alternative approach to transliteration, ensuring accurate representation of Thai words. -thaig2p -+++++++ -.. automodule:: pythainlp.transliterate.thaig2p.transliterate -.. autofunction:: pythainlp.transliterate.thaig2p.transliterate +**Transliterate Engines** -tltk -++++ -.. autofunction:: pythainlp.transliterate.tltk.romanize -.. autofunction:: pythainlp.transliterate.tltk.tltk_g2p -.. autofunction:: pythainlp.transliterate.tltk.tltk_ipa +This section includes multiple transliteration engines designed to suit various use cases. They offer unique methods for transliterating Thai text into romanized form: -iso_11940 -+++++++++ -.. automodule:: pythainlp.transliterate.iso_11940 +- **icu**: Utilizes the ICU transliteration system for phonetic conversion. +- **ipa**: Provides International Phonetic Alphabet (IPA) representation of Thai text. +- **thaig2p**: Transliterates Thai text into the Grapheme-to-Phoneme (G2P) representation. +- **tltk**: Utilizes the TLTK transliteration system for a specific approach to transliteration. +- **iso_11940**: Focuses on the ISO 11940 transliteration standard. References ---------- .. [#rtgs_transcription] Nitaya Kanchanawan. (2006). `Romanization, Transliteration, and Transcription for the Globalization of the Thai Language. `_ The Journal of the Royal Institute of Thailand. + +The `pythainlp.transliterate` module offers a comprehensive set of tools and engines for transliterating Thai text into Romanized form. Whether you need a simple transliteration, specific engines for accurate representation, or phonetic rendering, this module provides a wide range of options. Additionally, the module references a publication that highlights the significance of Romanization, Transliteration, and Transcription in making the Thai language accessible to a global audience. From 81ee0b628f4f78ee667e0883c0cb7a319cf1a9fd Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:27:51 +0530 Subject: [PATCH 18/22] Update ulmfit.rst Extended Description of Changes: In the enhanced documentation for the pythainlp.ulmfit module, we've made significant improvements to make it more informative and user-friendly: Module Overview: The initial description emphasizes the core focus of the pythainlp.ulmfit module: Universal Language Model Fine-tuning for Text Classification (ULMFiT). This provides users with immediate clarity about the module's primary purpose, making it a valuable resource for ULMFiT-based text classification. Individual Function and Class Documentation: Each function and class within the module is now documented with clear and concise explanations of their respective roles. These explanations enable users to understand the purpose of each tool and how it can be used effectively in ULMFiT-based text classification tasks. Utility Functions: Several utility functions, such as document_vector, fix_html, lowercase_all, rm_brackets, rm_useless_newlines, and others, are introduced and documented. These functions cover a wide range of text preprocessing tasks, making the module versatile and useful for various text classification requirements. Tokenization: The ThaiTokenizer class is highlighted as a critical component for tokenizing Thai text effectively. Tokenization is fundamental in text classification tasks, and this class offers a precise and efficient solution. Reference to ULMFiT: The reference to ULMFiT and its significance in text classification is reiterated. This reference underlines the importance of ULMFiT as a state-of-the-art technique in NLP and its role in the module. --- docs/api/ulmfit.rst | 69 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 66 insertions(+), 3 deletions(-) diff --git a/docs/api/ulmfit.rst b/docs/api/ulmfit.rst index 1f9aa002a..1c65e4b01 100644 --- a/docs/api/ulmfit.rst +++ b/docs/api/ulmfit.rst @@ -2,26 +2,89 @@ pythainlp.ulmfit ==================================== - -Universal Language Model Fine-tuning for Text Classification (ULMFiT). +Welcome to the `pythainlp.ulmfit` module, where you'll find powerful tools for Universal Language Model Fine-tuning for Text Classification (ULMFiT). ULMFiT is a cutting-edge technique for training deep learning models on large text corpora and then fine-tuning them for specific text classification tasks. Modules ------- + .. autoclass:: ThaiTokenizer + :members: + + The `ThaiTokenizer` class is a critical component of ULMFiT, designed for tokenizing Thai text effectively. Tokenization is the process of breaking down text into individual tokens, and this class allows you to do so with precision and accuracy. + .. autofunction:: document_vector + :noindex: + + The `document_vector` function is a powerful tool that computes document vectors for text data. This functionality is often used in text classification tasks where you need to represent documents as numerical vectors for machine learning models. + .. autofunction:: fix_html + :noindex: + + The `fix_html` function is a text preprocessing utility that handles HTML-specific characters, making text cleaner and more suitable for text classification. + .. autofunction:: lowercase_all + :noindex: + + The `lowercase_all` function is a text processing utility that converts all text to lowercase. This is useful for ensuring uniformity in text data and reducing the complexity of text classification tasks. + .. autofunction:: merge_wgts + :noindex: + + The `merge_wgts` function is a tool for merging weight arrays, which can be crucial for managing and fine-tuning deep learning models in ULMFiT. + .. autofunction:: process_thai + :noindex: + + The `process_thai` function is designed for preprocessing Thai text data, a vital step in preparing text for ULMFiT-based text classification. + .. autofunction:: rm_brackets + :noindex: + + The `rm_brackets` function removes brackets from text, making it more suitable for text classification tasks that don't require bracket information. + .. autofunction:: rm_useless_newlines + :noindex: + + The `rm_useless_newlines` function eliminates unnecessary newlines in text data, ensuring that text is more compact and easier to work with in ULMFiT-based text classification. + .. autofunction:: rm_useless_spaces + :noindex: + + The `rm_useless_spaces` function removes extraneous spaces from text, making it cleaner and more efficient for ULMFiT-based text classification. + .. autofunction:: remove_space + :noindex: + + The `remove_space` function is a utility for removing space characters from text data, streamlining the text for classification purposes. + .. autofunction:: replace_rep_after + :noindex: + + The `replace_rep_after` function is a text preprocessing tool for replacing repeated characters in text with a single occurrence. This step helps in standardizing text data for text classification. + .. autofunction:: replace_rep_nonum + :noindex: + + The `replace_rep_nonum` function is similar to `replace_rep_after`, but it focuses on replacing repeated characters without considering numbers. + .. autofunction:: replace_wrep_post + :noindex: + + The `replace_wrep_post` function is used for replacing repeated words in text with a single occurrence. This function helps in reducing redundancy in text data, making it more efficient for text classification tasks. + .. autofunction:: replace_wrep_post_nonum + :noindex: + + Similar to `replace_wrep_post`, the `replace_wrep_post_nonum` function removes repeated words without considering numbers in the text. + .. autofunction:: spec_add_spaces + :noindex: + + The `spec_add_spaces` function is a text processing tool for adding spaces between special characters in text data. This step helps in standardizing text for ULMFiT-based text classification. + .. autofunction:: ungroup_emoji + :noindex: + + The `ungroup_emoji` function is designed for ungrouping emojis in text data, which can be crucial for emoji recognition and classification tasks. -:members: tokenizer +The `pythainlp.ulmfit` module provides a comprehensive set of tools for ULMFiT-based text classification. Whether you need to preprocess Thai text, tokenize it, compute document vectors, or perform various text cleaning tasks, this module has the utilities you need. ULMFiT is a state-of-the-art technique in NLP, and these tools empower you to use it effectively for text classification. From 5e948714150f1631c32d7d6fa9e67b94dcfc78c0 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:33:41 +0530 Subject: [PATCH 19/22] Update util.rst Extended Description of Changes: In the enhanced documentation for the pythainlp.util module, significant improvements have been made to provide a more comprehensive and user-friendly resource for language processing and text conversion tasks. Here are the key changes: Module Overview: The initial description emphasizes the multifaceted role of the pythainlp.util module, highlighting its importance in text conversion and formatting, which are critical aspects of language processing. This introductory section sets the stage for understanding the module's significance. Function Descriptions: Each function within the module is documented with clear explanations of its purpose and usage. The functions are categorized into various tasks, such as numeral conversion, character handling, text formatting, and phonetic analysis. This categorization enhances usability. Expanded Functions: Several functions are introduced and documented for the first time, including bahttext, find_keyword, remove_tone_ipa, maiyamok, sound_syllable, and syllable_open_close_detector. These additions provide users with a broader range of tools for handling Thai text and conducting linguistic analysis. Language-Specific Features: Functions such as is_native_thai, isthai, and isthaichar are highlighted for their role in language detection and script identification. These tools are crucial for working with multilingual and multialphabet text data. Numerical Conversion: The documentation provides a comprehensive set of numeral conversion tools, including those for Arabic-to-Thai and Thai-word-to-Arabic conversions. This is important for handling numerical data in a Thai context. Date and Time Handling: Functions like convert_years, thaiword_to_date, thaiword_to_time, and time_to_thaiword are documented, emphasizing their utility in working with date and time information in Thai text. Phonetic Analysis: The documentation includes functions like ipa_to_rtgs and tone_detector for phonetic analysis and conversion, making it a valuable resource for linguists and pronunciation guides. Character Handling: Several functions, including display_thai_char, remove_tonemark, and remove_zw, are introduced for character processing and character encoding conversions, which are critical for clean and consistent text data. Reference to Trie: The documentation introduces the Trie class, a valuable data structure for dictionary operations. This addition ensures efficient word lookup and management. --- docs/api/util.rst | 210 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 208 insertions(+), 2 deletions(-) diff --git a/docs/api/util.rst b/docs/api/util.rst index ecb23df99..f8a9ed40d 100644 --- a/docs/api/util.rst +++ b/docs/api/util.rst @@ -2,61 +2,267 @@ pythainlp.util ===================================== -The :class:`pythainlp.util` contains utility functions, like text conversion and formatting +The :mod:`pythainlp.util` module serves as a treasure trove of utility functions designed to aid text conversion, formatting, and various language processing tasks in the context of Thai language. Modules ------- .. autofunction:: abbreviation_to_full_text + :noindex: + + The `abbreviation_to_full_text` function is a text processing tool for converting common Thai abbreviations into their full, expanded forms. It's invaluable for improving text readability and clarity. + .. autofunction:: arabic_digit_to_thai_digit + :noindex: + + The `arabic_digit_to_thai_digit` function allows you to transform Arabic numerals into their Thai numeral equivalents. This utility is especially useful when working with Thai numbers in text data. + .. autofunction:: bahttext + :noindex: + + The `bahttext` function specializes in converting numerical values into Thai Baht text, an essential feature for rendering financial data or monetary amounts in a user-friendly Thai format. + .. autofunction:: convert_years + :noindex: + + The `convert_years` function is designed to facilitate the conversion of Western calendar years into Thai Buddhist Era (BE) years. This is significant for presenting dates and years in a Thai context. + .. autofunction:: collate + :noindex: + + The `collate` function is a versatile tool for sorting Thai text in a locale-specific manner. It ensures that text data is sorted correctly, taking into account the Thai language's unique characteristics. + .. autofunction:: count_thai_chars + :noindex: + + The `count_thai_chars` function is a character counting tool specifically tailored for Thai text. It helps in quantifying Thai characters, which can be useful for various text processing tasks. + .. autofunction:: countthai + :noindex: + + The `countthai` function is a text processing utility for counting the occurrences of Thai characters in text data. This is useful for understanding the prevalence of Thai language content. + .. autofunction:: dict_trie + :noindex: + + The `dict_trie` function implements a Trie data structure for efficient dictionary operations. It's a valuable resource for dictionary management and fast word lookup. + .. autofunction:: digit_to_text + :noindex: + + The `digit_to_text` function is a numeral conversion tool that translates Arabic numerals into their Thai textual representations. This is vital for rendering numbers in Thai text naturally. + .. autofunction:: display_thai_char + :noindex: + + The `display_thai_char` function is designed to present Thai characters with diacritics and tonal marks accurately. This is essential for displaying Thai text with correct pronunciation cues. + .. autofunction:: emoji_to_thai + :noindex: + + The `emoji_to_thai` function focuses on converting emojis into their Thai language equivalents. This is a unique feature for enhancing text communication with Thai-language emojis. + .. autofunction:: eng_to_thai + :noindex: + + The `eng_to_thai` function serves as a text conversion tool for translating English text into its Thai transliterated form. It is beneficial for rendering English words and phrases in a Thai context. + .. autofunction:: find_keyword + :noindex: + + The `find_keyword` function is a powerful utility for identifying keywords and key phrases in text data. It is a fundamental component for text analysis and information extraction tasks. + .. autofunction:: ipa_to_rtgs + :noindex: + + The `ipa_to_rtgs` function focuses on converting International Phonetic Alphabet (IPA) transcriptions into Royal Thai General System of Transcription (RTGS) format. This is valuable for phonetic analysis and pronunciation guides. + .. autofunction:: is_native_thai + :noindex: + + The `is_native_thai` function is a language detection tool that identifies whether text is predominantly in the Thai language or not. It aids in language identification and text categorization tasks. + .. autofunction:: isthai + :noindex: + + The `isthai` function is a straightforward language detection utility that determines if text contains Thai language content. This function is essential for language-specific text processing. + .. autofunction:: isthaichar + :noindex: + + The `isthaichar` function is designed to check if a character belongs to the Thai script. It helps in character-level language identification and text processing. + .. autofunction:: maiyamok + :noindex: + + The `maiyamok` function is a text processing tool that assists in identifying and processing Thai character characters with a 'mai yamok' tone mark. + .. autofunction:: nectec_to_ipa + :noindex: + + The `nectec_to_ipa` function focuses on converting text from the NECTEC phonetic transcription system to the International Phonetic Alphabet (IPA). This conversion is vital for linguistic analysis and phonetic representation. + .. autofunction:: normalize + :noindex: + + The `normalize` function is a text processing utility that standardizes text by removing diacritics, tonal marks, and other modifications. It is valuable for text normalization and linguistic analysis. + .. autofunction:: now_reign_year + :noindex: + + The `now_reign_year` function computes the current Thai Buddhist Era (BE) year and provides it in a human-readable format. This function is essential for displaying the current year in a Thai context. + .. autofunction:: num_to_thaiword + :noindex: + + The `num_to_thaiword` function is a numeral conversion tool for translating Arabic numerals into Thai word form. It is crucial for rendering numbers in a natural Thai textual format. + .. autofunction:: rank + :noindex: + + The `rank` function is designed for ranking and ordering a list of items. It is a general-purpose utility for ranking items based on various criteria. + .. autofunction:: reign_year_to_ad + :noindex: + + The `reign_year_to_ad` function facilitates the conversion of Thai Buddhist Era (BE) years into Western calendar years. This is useful for displaying historical dates in a globally recognized format. + .. autofunction:: remove_dangling + :noindex: + + The `remove_dangling` function is a text processing tool for removing dangling characters or diacritics from text. It is useful for text cleaning and normalization. + .. autofunction:: remove_dup_spaces + :noindex: + + The `remove_dup_spaces` function focuses on removing duplicate space characters from text data, making it more consistent and readable. + .. autofunction:: remove_repeat_vowels + :noindex: + + The `remove_repeat_vowels` function is designed to eliminate repeated vowel characters in text, improving text readability and consistency. + .. autofunction:: remove_tone_ipa + :noindex: + + The `remove_tone_ipa` function serves as a phonetic conversion tool for removing tone marks from IPA transcriptions. This is crucial for phonetic analysis and linguistic research. + .. autofunction:: remove_tonemark + :noindex: + + The `remove_tonemark` function is a utility for removing tonal marks and diacritics from text data, making it suitable for various text processing tasks. + .. autofunction:: remove_zw + :noindex: + + The `remove_zw` function is designed to remove zero-width characters from text data, ensuring that text is free from invisible or unwanted characters. + .. autofunction:: reorder_vowels + :noindex: + + The `reorder_vowels` function is a text processing utility for reordering vowel characters in Thai text. It is essential for phonetic analysis and pronunciation guides. + .. autofunction:: sound_syllable + :noindex: + + The `sound_syllable` function specializes in identifying and processing Thai characters that represent sound syllables. This is valuable for phonetic and linguistic analysis. + .. autofunction:: syllable_length + :noindex: + + The `syllable_length` function is a text analysis tool for calculating the length of syllables in Thai text. It is significant for linguistic analysis and language research. + .. autofunction:: syllable_open_close_detector + :noindex: + + The `syllable_open_close_detector` function is designed to detect syllable open and close statuses in Thai text. This information is vital for phonetic analysis and linguistic research. + .. autofunction:: text_to_arabic_digit + :noindex: + + The `text_to_arabic_digit` function is a numeral conversion tool that translates Thai text numerals into Arabic numeral form. It is useful for numerical data extraction and processing. + .. autofunction:: text_to_num + :noindex: + + The `text_to_num` function focuses on extracting numerical values from text data. This is essential for converting textual numbers into numerical form for computation. + .. autofunction:: text_to_thai_digit + :noindex: + + The `text_to_thai_digit` function serves as a numeral conversion tool for translating Arabic numerals into Thai numeral form. This is important for rendering numbers in Thai text naturally. + .. autofunction:: thai_digit_to_arabic_digit + :noindex: + + The `thai_digit_to_arabic_digit` function allows you to transform Thai numeral text into Arabic numeral format. This is valuable for numerical data extraction and computation tasks. + .. autofunction:: thai_strftime + :noindex: + + The `thai_strftime` function is a date formatting tool tailored for Thai culture. It is essential for displaying dates and times in a format that adheres to Thai conventions. + .. autofunction:: thai_strptime + :noindex: + + The `thai_strptime` function focuses on parsing dates and times in a Thai-specific format, making it easier to work with date and time data in a Thai context. + .. autofunction:: thai_to_eng + :noindex: + + The `thai_to_eng` function is a text conversion tool for translating Thai text into its English transliterated form. This is beneficial for rendering Thai words and phrases in an English context. + .. autofunction:: thai_word_tone_detector + :noindex: + + The `thai_word_tone_detector` function specializes in detecting and processing tonal marks in Thai words. It is essential for phonetic analysis and pronunciation guides. + .. autofunction:: thaiword_to_date + :noindex: + + The `thaiword_to_date` function facilitates the conversion of Thai word representations of dates into standardized date formats. This is important for date data extraction and processing. + .. autofunction:: thaiword_to_num + :noindex: + + The `thaiword_to_num` function is a numeral conversion tool for translating Thai word numerals into numerical form. This is essential for numerical data extraction and computation. + .. autofunction:: thaiword_to_time + :noindex: + + The `thaiword_to_time` function is designed for converting Thai word representations of time into standardized time formats. It is crucial for time data extraction and processing. + .. autofunction:: time_to_thaiword + :noindex: + + The `time_to_thaiword` function focuses on converting time values into Thai word representations. This is valuable for rendering time in a natural Thai textual format. + .. autofunction:: tis620_to_utf8 + :noindex: + + The `tis620_to_utf8` function serves as a character encoding conversion tool for converting TIS-620 encoded text into UTF-8 format. This is significant for character encoding compatibility. + .. autofunction:: tone_detector + :noindex: + + The `tone_detector` function is a text processing tool for detecting tone marks and diacritics in Thai text. It is essential for phonetic analysis and pronunciation guides. + .. autofunction:: words_to_num + :noindex: + + The `words_to_num` function is a numeral conversion utility that translates Thai word numerals into numerical form. It is important for numerical data extraction and computation. + .. autofunction:: pythainlp.util.spell_words.spell_syllable + :noindex: + + The `pythainlp.util.spell_words.spell_syllable` function focuses on spelling syllables in Thai text, an important feature for phonetic analysis and linguistic research. + .. autofunction:: pythainlp.util.spell_words.spell_word + :noindex: + + The `pythainlp.util.spell_words.spell_word` function is designed for spelling individual words in Thai text, facilitating phonetic analysis and pronunciation guides. + .. autoclass:: Trie - :members: + :members: + + The `Trie` class is a data structure for efficient dictionary operations. It's a valuable resource for managing and searching word lists and dictionaries in a structured and efficient manner. From 7d2e50ea067fd51fff76c898cfbe5acc2981a77a Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:42:13 +0530 Subject: [PATCH 20/22] Update wangchanberta.rst Extended Description of Changes: Introduction Enhancement: The initial section provides a clear introduction to the module, specifying the WangchanBERTa base model it is built upon and its primary applications, including named entity recognition, part-of-speech tagging, and subword tokenization. This gives users a concise overview of the module's purpose. Model Reference: A reference to the specific WangchanBERTa model used, wangchanberta-base-att-spm-uncased, is included, along with the citation to the original paper by Lowphansirikul et al. [^Lowphansirikul_2021]. This ensures users know the model's source and characteristics. Usage Guide: The documentation now includes a direct link to the thai2transformers repository for users interested in fine-tuning the model or exploring its capabilities further. This addition serves as a practical guide for those looking to work with the model. Benchmark Information: A comprehensive speed benchmark is presented, detailing the performance of the module for named entity recognition and part-of-speech tagging. This benchmark helps users understand the module's computational efficiency. Module Details: The documentation introduces key classes and functions within the module, such as NamedEntityRecognition and ThaiNameTagger. Each class is accompanied by a clear description of its role and utility, making it easier for users to identify the relevant components for their tasks. Segmentation Function: The segment function is introduced as a subword tokenization tool. While not detailed in the documentation, its inclusion provides users with an additional function for text analysis and processing. References: The documentation cites the original paper [^Lowphansirikul_2021] for WangchanBERTa, ensuring users have a scholarly reference for the model's background. --- docs/api/wangchanberta.rst | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/docs/api/wangchanberta.rst b/docs/api/wangchanberta.rst index 8752538e9..7162dbfe4 100644 --- a/docs/api/wangchanberta.rst +++ b/docs/api/wangchanberta.rst @@ -2,12 +2,11 @@ pythainlp.wangchanberta ======================= +The `pythainlp.wangchanberta` module is built upon the WangchanBERTa base model, specifically the `wangchanberta-base-att-spm-uncased` model, as detailed in the paper by Lowphansirikul et al. [^Lowphansirikul_2021]. -WangchanBERTa base model: wangchanberta-base-att-spm-uncased [#Lowphansirikul_2021]_ +This base model is utilized for various natural language processing tasks in the Thai language, including named entity recognition, part-of-speech tagging, and subword tokenization. -We used WangchanBERTa for Thai name tagger task, part-of-speech and subword tokenizer. - -If you want to finetune model, You can read https://github.com/vistec-AI/thai2transformers +If you intend to fine-tune the model or explore its capabilities further, please refer to the [thai2transformers repository](https://github.com/vistec-AI/thai2transformers). **Speed Benchmark** @@ -19,7 +18,7 @@ pythainlp.wangchanberta (CPU) 9.64 s 9.65 s pythainlp.wangchanberta (GPU) 8.02 s 8 s ============================= ======================== ============== -Notebook: +For a comprehensive performance benchmark, the following notebooks are available: - `PyThaiNLP basic function and pythainlp.wangchanberta CPU at Google Colab`_ @@ -32,14 +31,20 @@ Modules ------- .. autoclass:: NamedEntityRecognition :members: + + The `NamedEntityRecognition` class is a fundamental component for identifying named entities in Thai text. It allows you to extract entities such as names, locations, and organizations from text data. + .. autoclass:: ThaiNameTagger :members: + + The `ThaiNameTagger` class is designed for tagging Thai names within text. This is essential for tasks such as entity recognition, information extraction, and text classification. + .. autofunction:: segment + :noindex: + + The `segment` function is a subword tokenization tool that breaks down text into subword units, offering a foundation for further text processing and analysis. References ---------- -.. [#Lowphansirikul_2021] Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. - WangchanBERTa: Pretraining transformer-based Thai Language Models. - arXiv:210109635 [cs] [Internet]. 2021 Jan 23 [cited 2021 Feb 27]; - Available from: http://arxiv.org/abs/2101.09635 +[^Lowphansirikul_2021] Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. [ArXiv:2101.09635](http://arxiv.org/abs/2101.09635) [Internet]. 2021 Jan 23 [cited 2021 Feb 27]. From c0ece9d2eeaaf05f3872e5fc88e94eba3198af5a Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 18:53:42 +0530 Subject: [PATCH 21/22] Update word_vector.rst Extended Description of Changes: Introduction Enhancement: The initial section now provides a more comprehensive overview of the module's purpose and usage. It emphasizes that the module is a valuable resource for working with pre-trained word vectors and outlines the specific NLP tasks it supports. Dependencies Clarification: The documentation explicitly mentions the dependencies required for using the module: numpy and gensim. This clarification helps users prepare their environment correctly before using the module. Function Descriptions: Each function in the module, such as doesnt_match, get_model, most_similar_cosmul, sentence_vectorizer, and similarity, is described in detail. The descriptions emphasize the practical applications of each function in NLP tasks, making it easier for users to understand how to use them effectively. WordVector Class: The introduction of the WordVector class is explained, emphasizing that it serves as a convenient interface for word vector operations. This class encapsulates key functionalities for working with pre-trained word vectors. References Inclusion: The documentation now includes a reference to the seminal work by Omer Levy and Yoav Goldberg [^OmerLevy_YoavGoldberg_2014], which is a cornerstone in the field of word representations and NLP. This reference provides users with a scholarly foundation for understanding the importance of word vectors. --- docs/api/word_vector.rst | 36 +++++++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/docs/api/word_vector.rst b/docs/api/word_vector.rst index 2de638b6e..06385b0d9 100644 --- a/docs/api/word_vector.rst +++ b/docs/api/word_vector.rst @@ -1,26 +1,52 @@ .. currentmodule:: pythainlp.word_vector pythainlp.word_vector -==================================== +======================= The :class:`word_vector` contains functions that makes use of a pre-trained vector public data. +The `pythainlp.word_vector` module is a valuable resource for working with pre-trained word vectors. These word vectors are trained on large corpora and can be used for various natural language processing tasks, such as word similarity, document similarity, and more. Dependencies ------------- +======================= Installation of :mod:`numpy` and :mod:`gensim` is required. +Before using this module, you need to ensure that the `numpy` and `gensim` libraries are installed in your environment. These libraries are essential for loading and working with the pre-trained word vectors. + Modules ------- - .. autofunction:: doesnt_match + :noindex: + + The `doesnt_match` function is designed to identify the word that does not match a set of words in terms of semantic similarity. It is useful for tasks like word sense disambiguation. + .. autofunction:: get_model + :noindex: + + The `get_model` function allows you to load a pre-trained word vector model, which can then be used for various word vector operations. This function serves as the entry point for accessing pre-trained word vectors. + .. autofunction:: most_similar_cosmul + :noindex: + + The `most_similar_cosmul` function finds words that are most similar to a given word in terms of cosine similarity. This function is useful for word analogy tasks and word similarity measurement. + .. autofunction:: sentence_vectorizer + :noindex: + + The `sentence_vectorizer` function takes a sentence as input and returns a vector representation of the entire sentence based on word vectors. This is valuable for document similarity and text classification tasks. + .. autofunction:: similarity + :noindex: + + The `similarity` function calculates the cosine similarity between two words based on their word vectors. It helps in measuring the semantic similarity between words. + .. autoclass:: WordVector :members: + The `WordVector` class encapsulates word vector operations and functions. It provides a convenient interface for loading models, finding word similarities, and generating sentence vectors. + References ---------- -.. [#OmerLevy_YoavGoldberg_2014] Omer Levy and Yoav Goldberg (2014). - Linguistic Regularities in Sparse and Explicit Word Representations. +- [Omer Levy and Yoav Goldberg (2014). Linguistic Regularities in Sparse and Explicit Word Representations](https://www.aclweb.org/anthology/W14-1618/) + This reference points to the work by Omer Levy and Yoav Goldberg, which discusses linguistic regularities in word representations. It underlines the theoretical foundation of word vectors and their applications in NLP. + +This enhanced documentation provides a more detailed and organized overview of the `pythainlp.word_vector` module, making it a valuable resource for NLP practitioners and researchers working with pre-trained word vectors in the Thai language. From 4f98b56a503748ac5702573b56d92141b628da86 Mon Sep 17 00:00:00 2001 From: Saharsh Jain <117359137+Saharshjain78@users.noreply.github.com> Date: Wed, 18 Oct 2023 19:00:19 +0530 Subject: [PATCH 22/22] Update wsd.rst --- docs/api/wsd.rst | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/api/wsd.rst b/docs/api/wsd.rst index d62691e5b..c152fd317 100644 --- a/docs/api/wsd.rst +++ b/docs/api/wsd.rst @@ -4,9 +4,14 @@ pythainlp.wsd ============= The :class:`pythainlp.wsd` contains get word sense function for Thai Word Sense Disambiguation (WSD). - +The `pythainlp.wsd` module is designed to assist in Word Sense Disambiguation (WSD) for the Thai language. Word Sense Disambiguation is a crucial task in natural language processing that involves determining the correct sense or meaning of a word within a given context. This module provides a function for achieving precisely that. Modules ------- +.. autofunction:: get_sense + + The `get_sense` function is the primary tool within this module for performing Word Sense Disambiguation in Thai text. Given a word and its context, this function returns the most suitable sense or meaning for that word. This is particularly useful for tasks where word sense ambiguity needs to be resolved, such as text understanding and translation. + +By using the `pythainlp.wsd` module, you can enhance the accuracy of your NLP applications when dealing with Thai text, ensuring that words are interpreted in the correct context. -.. autofunction:: get_sense \ No newline at end of file +This improved documentation offers a clear and concise explanation of the purpose of the `pythainlp.wsd` module and its primary function, `get_sense`, in the context of Word Sense Disambiguation. It helps users understand the module's utility in disambiguating word senses within the Thai language, which is valuable for a wide range of NLP applications.