lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Gundlach (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-236) Field collapsing
Date Mon, 09 Nov 2009 23:47:32 GMT

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775192#action_12775192
] 

Michael Gundlach edited comment on SOLR-236 at 11/9/09 11:45 PM:
-----------------------------------------------------------------

I've found an NPE that occurs when performing quasi-distributed field collapsing.

My company only has one use case for field collapsing: collapsing on email address.  Our index
is spread across multiple cores.  We found that if we shard by email address, so that all
documents with a given email address are guaranteed to appear on the same core, then we can
do distributed field collapsing.

We add &collapse.field=email and &shards=core1,core2,... to a regular query.  Each
core collapses on email and sends the results back to the requestor.  Since no emails appear
on more than one core, we've accomplished distributed search.  We do lose the <collapse_count>
section, but that's not needed for our purpose -- we just need an accurate total document
count, and to have no more than one document for a given email address in the results.

Unfortunately, this throws an NPE when searching on a tokenized field.  Searching string fields
is fine.  I don't understand exactly why the NPE appears, but I did bandaid over it by checking
explicitly for nulls at the appropriate line in the code.  No more NPE.

There's a downside, which is that if we attempt to collapse on a field other than email --
one which has documents appearing in multiple cores -- the results are buggy: the first search
returns few documents, and the number of documents actually displayed don't always match the
"numFound" value.  Then upon refresh we get what we think is the correct numFound, and the
correct list of documents.  This doesn't bother me too much, as you're guaranteed to get incorrect
answers from the collapse code anyway when collapsing on a field that you didn't use as your
key for sharding.

In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute
the fix, or at least point out the error:

1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a
patch file.  The resultant patch file looks very different from the latest SOLR-236 patchfile,
so I assume I did something wrong.

2. I pulled trunk, made my 2 line change, and created another patch file.  This file is tiny
but of course is missing all of the field collapsing changes.

Would you like me to post either of these patchfiles to this issue?  Or is it sufficient to
just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse,
doc);" where sdoc was null.)  Perhaps my use case is extraordinary enough that you're happy
leaving the NPE in place and telling other users to not do what I'm doing?

Thanks!
Michael

      was (Author: gundlach):
    I've found an NPE that occurs when performing quasi-distributed field collapsing.

My company only has one use case for field collapsing: collapsing on email address.  Our index
is spread across multiple cores.  We found that if we shard by email address, so that a given
all documents with a given email address are guaranteed to appear on the same core, then we
can do distributed field collapsing.

We add &collapse.field=email and &shards=core1,core2,... to a regular query.  Each
core collapses on email and sends the results back to the requestor.  Since no emails appear
on more than one core, we've accomplished distributed search.  We do lose the <collapse_count>
section, but that's not needed for our purpose -- we just need an accurate total document
count, and to have no more than one document for a given email address in the results.

Unfortunately, this throws an NPE when searching on a tokenized field.  Searching string fields
is fine.  I don't understand exactly why the NPE appears, but I did bandaid over it by checking
explicitly for nulls at the appropriate line in the code.  No more NPE.

There's a downside, which is that if we attempt to collapse on a field other than email --
one which has documents appearing in multiple cores -- the results are buggy: the first search
returns few documents, and the number of documents actually displayed don't always match the
"numFound" value.  Then upon refresh we get what we think is the correct numFound, and the
correct list of documents.  This doesn't bother me too much, as you're guaranteed to get incorrect
answers from the collapse code anyway when collapsing on a field that you didn't use as your
key for sharding.

In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute
the fix, or at least point out the error:

1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a
patch file.  The resultant patch file looks very different from the latest SOLR-236 patchfile,
so I assume I did something wrong.

2. I pulled trunk, made my 2 line change, and created another patch file.  This file is tiny
but of course is missing all of the field collapsing changes.

Would you like me to post either of these patchfiles to this issue?  Or is it sufficient to
just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse,
doc);" where sdoc was null.)  Perhaps my use case is extraordinary enough that you're happy
leaving the NPE in place and telling other users to not do what I'm doing?

Thanks!
Michael
  
> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch,
collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch,
field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch,
field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to
a single entry in the result set. Site collapsing is a special case of this, where all results
for a given web site is collapsed into one or two entries in the result set, typically with
an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message