lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khaled Hammouda (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1630) StringIndexOutOfBoundsException in SpellCheckComponent
Date Wed, 02 Jun 2010 19:56:39 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874765#action_12874765
] 

Khaled Hammouda commented on SOLR-1630:
---------------------------------------

We just hit this bug as well. To reproduce, you must index a document that contains a hyphen
(or underscore) and then search with a misspelled version of the indexed text; e.g.

document contains: mid-term
query: mis-term
result: exception thrown

I looked at the code of where this is happening and it seems to be related to token offsets
(of the tokenized query) in conjunction with a feature of the spellcheck component called
collation. Basically collation tries to replace the original query with the top suggested
words. It relies on the tokenizer to remove the original misspelled words and insert the suggested
ones (using StringBuilder.replace). Unfortunately the token offsets look weird for words with
hyphens (or underscore); for example:

query: abc_def
1st token: value = abc; startOffset = 0; endOffset = 7
2nd token: value = def; startOffset = 0; endOffset = 7

Because the two tokens occupy the same range (0-7) this messes up the replacement logic. I'm
not sure if this tokenizer behavior is the correct one, but it's part of the problem.

Having said that, I tried to change the spellcheck tokenizer from standard to whitespace and
this actually solved the problem; no errors and I get correct suggestions.

So, until this gets fixed you can either:

1) Disable spellchecker collation, or
2) Use a whitespace tokenizer for the spellchecker component

> StringIndexOutOfBoundsException in SpellCheckComponent
> ------------------------------------------------------
>
>                 Key: SOLR-1630
>                 URL: https://issues.apache.org/jira/browse/SOLR-1630
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis, spellchecker
>    Affects Versions: 1.4
>         Environment: Solr 1.4
> Lucene 2.9.1
> Win XP
> java version "1.6.0_14"
>            Reporter: Robin Wojciki
>            Assignee: Shalin Shekhar Mangar
>         Attachments: bug.xml, schema.xml, SOLR-1630.patch, solrconfig.xml, spellcheckconfig.xml
>
>
> For some documents/search strings, the SpellCheckComponent throws StringIndexOutOfBoundsException
> See: http://www.lucidimagination.com/search/document/3be6555227e031fc/
> h2. Replication
>  * Save attached schema.xml and solrconfig.xml in apache-solr-1.4.0/example/solr/conf
>  * Start Solr
>  * Index attached bug.xml
>  * Query [http://localhost:8983/solr/select/?q=awehjse-wjkekw]
> It throws a StringIndexOutOfBoundsException
> {noformat} String index out of range: -7
> java.lang.StringIndexOutOfBoundsException: String index out of range: -7
> 	at java.lang.AbstractStringBuilder.replace(Unknown Source)
> 	at java.lang.StringBuilder.replace(Unknown Source)
> 	at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248)
> 	at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143)
> 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> {noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message