lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucene User no 1981 <Mac...@holiday-rentals.co.uk>
Subject Re: Spell check of a large text
Date Fri, 12 Dec 2008 10:36:55 GMT

Grant,

It's definitely dictionary based spell checker. A bit fleshing out, 
currently the document gets indexed and then it's analysed (bad words, 
repetitions etc), spell check - no corrections - would be yet another 
step in the process. It's all read-only stuff, the document content is 
not modified, it's just tagged accordingly.
That said, I kind of like your idea, I mean token filter looks like the 
good candidate. As of Lazzy, is it any different than Lucene 
SpellChecker (ngram based)? what really matters here is not the 
accuracy (decent but not exceptional - there is a manual double- check 
of tagged docs anyway), what matters most is performance and ease of 
integration. Any grammar check is absolutely immaterial.
About that payload idea, I can only work with a token in a filter. I 
could attach something and spit it out, but what would be that 
something? It would have to be searchable I assume, otherwise I could 
perform the check without filter, out of index. If it's searchable 
then, apart from querying, I could perhaps make highlighter work with 
it nicely.

Thx,
Mac


Grant Ingersoll-6 wrote:
> 
> I think I'm missing something here...
> 
> Spell checked in what sense?  Sounds to me like you need dictionary  
> based spell checking during index, not index based spelling during  
> search, right?
> 
> How about hooking up something like the Jazzy spell checker into a  
> TokenFilter?  Then, as the tokens stream by, you lookup the spelling  
> and then add a 1 byte payload to all words that are misspelled.
> 
> As for Highlighter, hmmm...  Not sure if there is a way to make a  
> Fragmenter/Scorer that was payload aware, such that it would only  
> produce fragments (and scores) for sections of the file that have  
> these payloads.  Definitely pushing my area of expertise, but maybe  
> one of the Highlighter experts can chime in.
> 
> HTH,
> Grant
> 
> On Dec 11, 2008, at 6:18 AM, Lucene User no 1981 wrote:
> 
>>
>> Hi,
>>
>> the problem is as follows: there is a text, ca. 30kb, it has to be
>> spellchecked automatically, there is no manual intervention, no  
>> suggestions
>> needed. All I would like to achieve is a simple check if there are any
>> problems with the spelling or not. It has to be rather fast cause  
>> there are
>> tons of docs a minute going thru the system. Solutions like
>> SpellChecker.exists() don't really apply. Additionally, spelling  
>> errors
>> could be highlighted - haven't really found any reasonable way of  
>> leveraging
>> Highlighter for that task.
>>
>> Does anyone have any idea how this problem can be addressed with  
>> Lucene?
>>
>> Regards,
>> Mac
>> -- 
>> View this message in context:
>> http://www.nabble.com/Spell-check-of-a-large-text-tp20953625p20953625.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Spell-check-of-a-large-text-tp20953625p20973238.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message