-
Notifications
You must be signed in to change notification settings - Fork 277
Spell-Correct: Probability of all corrected words are the same #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ObservationConfirmed @MingPawat observation. Below is a result from PyThaiNLP 1.7.0.1: >>> from pythainlp.spell import pn
>>> pn.prob("กิน")
1.9348347651110595e-05
>>> pn.prob("ข้าว")
1.9348347651110595e-05
>>> pn.prob("กัน")
1.9348347651110595e-05
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0 All words that included in dictionary will have probability value of 1.9348347651110595e-05, I tried to use word frequencies from Thai National Corpus instead WORDS = Counter(thaiword.get_data()) with WORDS = Counter(dict(tnc.get_word_frequency_all())) Here's the result >>> pn.prob("กิน")
0.0006138412282452856
>>> pn.prob("ข้าว")
0.00026716049573969757
>>> pn.prob("กัน")
0.003979265980548341
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0 Difference in spelling check end resultOriginal spell checker (using thaiword.txt): >>> pythainlp.spell("เหลีนม")
['เหลิม', 'เหลียน', 'เหลือม', 'เหลน', 'เหลียน', 'เลียม', 'เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม', 'เหลียน', 'เหลน', 'เหลิม', 'เหลี่ยม', 'เหลียว', 'เหลือม', 'เหนียม', 'เหลียน', 'เหลียน', 'เหลี่ยม', 'เหลี่ยม', 'เหลิม', 'เหลิม']
>>> pythainlp.spell("เหลียม")
['เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม'] Modified spell checker (using TNC word frequency): >>> pythainlp.spell("เหลีนม")
['เหลียม']
>>> pythainlp.spell("เหลียม")
['เหลียม'] This is mainly because thaiword.txt does not contain the word "เหลียม", but TNC does. Other tests with TNC: >>> pythainlp.spell("กกฎาคม")
['กรกฎาคม']
>>> pythainlp.spell("อนุญาติ")
['อนุญาต']
>>> pythainlp.spell("กิเลย")
['กิเลน', 'กิเลส']
>>> pythainlp.spell("สัตค์")
['สัตว์', 'สัตย์', 'สัตร์', 'สัตถ์'] From a quick human (me) judgement, the suggesting order looks reasonable. Possible problem with "real world" examplesThe problem with using text from a corpus (like TNC) is that, if there is a misspelled word in the example, spell checker may suggest a misspelled word. Have to find out on this as well. |
@MingPawat I have put a pull request #137 to fix this based on your suggestion. If you have time, please review if it works in a correct way. Thank you. |
Fixed with #137 |
In pythainlp/pythainlp/spell/pn.py, you said you fork the code from http://norvig.com/spell-correct.html. As far as I understand, you use the same implementation as in the link.
In your code, you import "WORDS" from dictionary. Instead the link above use corpus (big.txt) rather than dictionary. This make the probability of the corrected words are the same because all words appear only once. The idea behind this code is to chose the most frequent word in the corpus.
Just change "WORDS" to the big corpus.
The text was updated successfully, but these errors were encountered: