Apache OpenOffice (AOO) Bugzilla – Issue 43583
RFE: Spellchecker API doesn't appropriate for languages without any space between words like Thai
Last modified: 2013-02-24 21:08:41 UTC
Thai text doesn't have any space between words. This usually break assumption in some codes. The API com.sun.star.linguistic2.XSpellChecker is one of them. XSpellChecker is the the interface for spell checking. It has two method :- bool isValid(string aWord, Locale, aProperties); XSpellAlternatives spell(string aWord, Locale, aProperties); see http://api.openoffice.org/docs/common/ref/com/sun/star/linguistic2/XSpellChecker.html Normally isValid() is used to check whether a word is correctly spelled. Then if the word is incorrectly spelled, spell() will be used to get suggested spelling alternatives from the dictionaries. This works as long as you get the boundary of the spelling error right, which is the case for Western text. However, for text in the languages without any space between words like Thai, it's usually impossible to know the boundary of the spelling error correctly (before trying to find suggestions). I'll use English but without space between words as an example :- Theytrytomanufacturetables. (They try to manufacture table) In case the word 'manufacture' is misspelled as 'manifacture' Theytrytomanifacturetables. The word breaker (which works with Thai) will break the words like this :- They|try|to|man|ifacture|tables|. Each word will pass isValid() until 'ifacture'. The string 'ifacture' will be flagged as misspelled. Calling spell() you may get 'facture' as a suggestion. Which is incorrect! Theytrytomanfacturetables. (They try to manfacture tables). This is because XSpellChecker doesn't have the whole picture of the input strings. It only sees one segmented word (sending from ICU word breaker) at a time. It can not do any better. To make the situation worse, ICU DictionaryBasedBreakIterator uses a dictionary for Thai which will be different from the one used by the Thai spellchecker. (Thai spellcheckers have been implemented in Thai versions of OOo like OpenOfficeTLE and Pladao but the quality is not good enough to be usable because of the issues mentioned here). How to do it appropriately for languages without space between words:- 1) send the whole string to spellchecker 2) have a function to iterate the misspelled words in the string For example :- class XSpellChecker2 { XSpellAlternatives* spell(const OUString& string, int start, const Locale&, aProperties, int& errorBegin, int& errorEnd); } Then you can loop thru spelling errors by:- int begin, end, i = 0; while( i < str.getLength() ) { XSpellAlternatives* xAlt = spellchecker2.spell(str, i, locale, emptyProps, begin, end); if (xAlt != NULL) { OUString spellingError = str.copy(begin, end-begin); DisplayAlternatives(spellingError, xAlt); } i = end; } How this solve the problem for Thai.:- This is an algorithm for Thai. They|try|to|man<ifacture>tables|. XSpellChecker2::spell() for Thai will iterate thru the correct words until 'ifacture'. Then it will try to find the suggestions for :- - 'ifacture', found 'facture' - then plus one word before, 'manifacture' found 'manufacture' - then plus one word after, 'ifacturetables' - not found - then plus one word before and after, 'manifacturetables' - not found The algorithm will select the result with the longest misspelling - 'manifacture' and suggest 'manufacture', correctly. e.g. xAlt = spellchecker2.spell("Theytrytomanfacturetables", 0, locale, emptyProps, begin, end); // begin = 9, end = 19, xAlt contains 'manufacture' An example in Thai:- ใช้คอมพวเตอร์ได้ The word คอมพิวเตอร์ is misspelled as คอมพวเตอร์. However คอ is a word in Thai. Segmented as :- ใช้|คอ<มพวเตอร์>ได้ Old algorithm :- flag 'มพวเตอร์' as misspelling, find no suggestion. New algorithm :- - Word breaking found that the segment "มพวเตอร์" is not a Thai word - try "มพวเตอร์", fail - try "คอมพวเตอร์" suggest "คอมพิวเตอร์" - try "มพวเตอร์ได้", fail - try "คอมพวเตอร์ได้", fail - Found that misspelling is "คอมพวเตอร์", suggest "คอมพิวเตอร์"
confirmed.
Hi, I think this problem happens not only in oriental languages (Thai, Chinese, ...) but probably happens in all languages that join words, e. g. German.
For languages, that do use spaces as word limit, but use compound words, the collection of the whole vocabulary is the right way. see: http://tkltrans.sourceforge.net/tklspell/compound.htm for explanation. This study also might show some pitfalls with the here suggested algorithm. I admit, I have no better solution for Thai, than Samphan's suggested one. I think, for languages where writing is more word based, like Chinese or Japanese, Samphan's solution is quite good, for languages, where writing is more character based, there are some serious pitfalls, and there Samphan's suggested solution is not very good. Eleonora
Some thoughts. English examples: She arab bin ary she a rabbi nary shea rabby nary shear abby nary --- abby could be wrong There are innumerous possibilities for wrong and good words in a simple sentence. Where should the loop continue? after she? shea? shear?
Samphan, This problem is not a standard spell checking problem. You must provide a besides the standard dic/aff word files also a sentence-to-word-breaking program or subroutine, that breaks down each sentence to words. Then the spell checker can check it according to the standard rules, offering the replacements, if any. Your word breaking algorithm must use a dictionary, that enables you to break the sentence into words, maybe a POS (Part of Sentence) tagger, that helps to find the optimal breaking algorithm. The final functionality would look: myspell gets the sentence - myspell calls your breaker program/subroutine - myspell checks the into words broken sentence, and passes back the suggestions, if any. myspell gets the next sentence, etc.... The only support you can expect from the spell checker is to provide the interface to your sentence breaking program/subroutine.
Samphan, if and when using ICU word breaker, you should better make sure that it uses the same dictionary as myspell. This would narrow down the problems like They|try|to|man|ifacture|tables|, you mention. Since ICU is an open source project, this should be possible.
I suggest to extend or replace the ICU DictionaryBasedBreakIterator to/with a more sophisticated (Thai)POSTaggerAndSpellCheckerSuggestionBasedBreakIterator. I think, this is not a spell checking problem, but for a better word breaking you need the help of the spell checker's suggestion mechanism too. But I can imagine a simple (half) solution too. Modify, extends or replace the ICU DictionaryBasedBreakIterator to break an unknown word with its known neighbours in case of Thai texts: Thentrytomanifacturetables -> (They|try|to|man|ifacture|tables|)-> They|try|to|manifacturetables| MySpell checks the "manifacturetables" as a compound word, and suggests "manufacturetables". Not so pretty, but works. Don't you need new API's. (Perhaps it's not a right solution, because it modifies the hyphenation or type-setting. I don't know.) Laci
> (Perhaps it's not a right solution, because it modifies the > hyphenation or type-setting. I don't know.) This is not problem, if you correct the word, or put the unknown word into the custom dictionary, because the breakiterator re-count the word splittings. But first this is not so trivial for users. Laci
set target to OOo Later.
FT: I'm leaving so I will re-assign this issue to requirement default user
OpenOffice.org Issue Tracker - Feedback Request. The Issue you raised is currently assigned to 'Requirements' pending review, but has not been updated within the last 2+ years. Please consider re-testing with one of the latest versions of OOo, as the problem(s) may have already been addressed. Either use the recent stable version: http://download.openoffice.org/index.html or consider trying the new OOo 3 BETA (still in testing): http://download.openoffice.org/3.0beta/ Please report back the outcome so this Issue may be Closed or Progressed as necessary - otherwise it may be Resolved as Invalid in the future. You may also wish to search for (and note) any duplicates of this Issue that may have advanced further by checking the Issue Tracker: http://www.openoffice.org/issues/query.cgi Many thanks, Andrew Cleaning-up and Closing old Issues as part of: ~ The Grand Bug Squash, pre v3 ~ http://marketing.openoffice.org/3.0/announcementbeta.html
Confirm that the issue still valid
Just starting working on a solution for Khmer word breaking with the friendly folks at ICU - hopefully there will be some good progress!