lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Dyer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
Date Thu, 30 Sep 2010 16:15:35 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916533#action_12916533
] 

James Dyer commented on SOLR-2010:
----------------------------------

Grant,

It wouldn't be difficult to create an uber-patch that allows users to pick which way to go.
If that's the route you want to go then I'd be happy to do that. However, I think it would
be best to stick with the "recombine" approach because although you'll get throw-away collations,
it will always be done internally within the shard. The performance penalty in most cases
will be slight. On the other hand, if using the "Search Handler" approach, it has to query
over the network for each *try*, which could be significant. I wouldn't say that you would
never benefit from the "Search Handler" option, but I wonder if it warrants extra lines of
code and making changes to the SearchHandler class, etc.

Unfortunately I haven't done any performance testing with these. We only are in early development
here with SOLR and I don't have access to multiple servers with which I can easily deploy
such a test. On a non-distributed setup this patch only adds a little bit of overhead, and
I wouldn't expect the "recombine" option to be much worse than that.

Note that with either approach I'd imagine you'd frequently run into the case where some/many
shards simply do not have the documents the user is looking for and they will have to query
up to "collationMaxTries" to come up empty. In which case the shard(s) that get the results
may need to wait for the shards that are busy querying away in vain...

Let me know if you want an "uber-patch". I might have a little time later today if you let
me know.

> Improvements to SpellCheckComponent Collate functionality
> ---------------------------------------------------------
>
>                 Key: SOLR-2010
>                 URL: https://issues.apache.org/jira/browse/SOLR-2010
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, spellchecker
>    Affects Versions: 1.4.1
>         Environment: Tested against trunk revision 966633
>            Reporter: James Dyer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch,
SOLR-2010.txt, SOLR-2010_141.patch, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardRecombineCollations_999521.patch,
SOLR-2010_shardSearchHandler_993538.patch, SOLR-2010_shardSearchHandler_999521.patch
>
>
> Improvements to SpellCheckComponent Collate functionality
> Our project requires a better Spell Check Collator.  I'm contributing this as a patch
to get suggestions for improvements and in case there is a broader need for these features.
> 1. Only return collations that are guaranteed to result in hits if re-queried (applying
original fq params also).  This is especially helpful when there is more than one correction
per query.  The 1.4 behavior does not verify that a particular combination will actually return
hits.
> 2. Provide the option to get multiple collation suggestions
> 3. Provide extended collation results including the # of hits re-querying will return
and a breakdown of each misspelled word and its correction.
> This patch is similar to what is described in SOLR-507 item #1.  Also, this patch provides
a viable workaround for the problem discussed in SOLR-1074.  A dictionary could be created
that combines the terms from the multiple fields.  The collator then would prune out any spurious
suggestions this would cause.
> This patch adds the following spellcheck parameters:
> 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before
giving up.  Lower values ensure better performance.  Higher values may be necessary to find
a collation that can return results.  Default is 0, which maintains backwards-compatible behavior
(do not check collations).
> 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 1, which
maintains backwards-compatible behavior.
> 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing
collations found.  default is false, which maintains backwards-compatible behavior.  When
true, output is like this (in context):
> <lst name="spellcheck">
> 	<lst name="suggestions">
> 		<lst name="hopq">
> 			<int name="numFound">94</int>
> 			<int name="startOffset">7</int>
> 			<int name="endOffset">11</int>
> 			<arr name="suggestion">
> 				<str>hope</str>
> 				<str>how</str>
> 				<str>hope</str>
> 				<str>chops</str>
> 				<str>hoped</str>
> 				etc
> 			</arr>
> 		<lst name="faill">
> 			<int name="numFound">100</int>
> 			<int name="startOffset">16</int>
> 			<int name="endOffset">21</int>
> 			<arr name="suggestion">
> 				<str>fall</str>
> 				<str>fails</str>
> 				<str>fail</str>
> 				<str>fill</str>
> 				<str>faith</str>
> 				<str>all</str>
> 				etc
> 			</arr>
> 		</lst>
> 		<lst name="collation">
> 			<str name="collationQuery">Title:(how AND fails)</str>
> 			<int name="hits">2</int>
> 			<lst name="misspellingsAndCorrections">
> 				<str name="hopq">how</str>
> 				<str name="faill">fails</str>
> 			</lst>
> 		</lst>
> 		<lst name="collation">
> 			<str name="collationQuery">Title:(hope AND faith)</str>
> 			<int name="hits">2</int>
> 			<lst name="misspellingsAndCorrections">
> 				<str name="hopq">hope</str>
> 				<str name="faill">faith</str>
> 			</lst>
> 		</lst>
> 		<lst name="collation">
> 			<str name="collationQuery">Title:(chops AND all)</str>
> 			<int name="hits">1</int>
> 			<lst name="misspellingsAndCorrections">
> 				<str name="hopq">chops</str>
> 				<str name="faill">all</str>
> 			</lst>
> 		</lst>
> 	</lst>
> </lst>
> In addition, SOLRJ is updated to include SpellCheckResponse.getCollatedResults(), which
will return the expanded Collation format.  getCollatedResult(), which returns a single String,
is retained for backwards-compatibility.  Other APIs were not changed but will still work
provided that spellcheck.collateExtendedResult is false.
> This likely will not return valid results if using Shards.  Rather, a more robust interaction
with the index would be necessary than what exists in SpellCheckCollator.collate().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message