lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Spell check of a large text
Date Fri, 12 Dec 2008 17:57:30 GMT

On Dec 12, 2008, at 5:36 AM, Lucene User no 1981 wrote:

> Grant,
> It's definitely dictionary based spell checker. A bit fleshing out,
> currently the document gets indexed and then it's analysed (bad words,
> repetitions etc), spell check - no corrections - would be yet another
> step in the process. It's all read-only stuff, the document content is
> not modified, it's just tagged accordingly.
> That said, I kind of like your idea, I mean token filter looks like  
> the
> good candidate. As of Lazzy, is it any different than Lucene
> SpellChecker (ngram based)?

Yes, Jazzy is actually a dictionary of correctly spelled words.   
Lucene's approach (at least the index based one) is merely a  
dictionary of words that occur in your corpus, misspellings and all.   
So, if your goal is to tag words that are really, truly spelled  
incorrectly, than I'd say Jazzy or some other dictionary tool is the  
way to go.

> what really matters here is not the
> accuracy (decent but not exceptional - there is a manual double- check
> of tagged docs anyway), what matters most is performance and ease of
> integration. Any grammar check is absolutely immaterial.
> About that payload idea, I can only work with a token in a filter. I
> could attach something and spit it out, but what would be that
> something? It would have to be searchable I assume, otherwise I could
> perform the check without filter, out of index. If it's searchable
> then, apart from querying, I could perhaps make highlighter work with
> it nicely.

Payloads live on Tokens.  See the Token.setPayload() method.  It would  
then be searchable by using the BoostingTermQuery (BTQ) but you may  
need to write some other type of query.
For instance, the BTQ will allow you to say, I believe, give me all  
documents where a particular terms is misspelled and you can specify  
that term.  However, you may also want "give me all documents that  
have misspellings" and that is not something the BTQ can do.  You  
probably could hack up the MatchAllDocsQuery to do it though.  Or you  
could maybe write a QueryFilter that turns on all docs that have a  
payload present.  This is totally out there at this point, so take it  
with a grain of salt.  I think you can achieve what you want, but it  
will take some lifting.

I have no clue on the performance, but I think the indexing approach  
could be pretty fast, especially if you can perhaps test a cache of  
commonly misspelled terms, but I would test that first.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message