lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martijn van Groningen (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-236) Field collapsing
Date Fri, 29 May 2009 13:02:46 GMT

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714442#action_12714442
] 

Martijn van Groningen edited comment on SOLR-236 at 5/29/09 6:02 AM:
---------------------------------------------------------------------

Hi,

I have modified the latest patch of Thomas and made two performance improvements: 
1) Improved normal field collapsing. I tested it with an index 1.1 million documents. When
collapsing on all documents and with no sorting specified (so sorting on score) the query
time is around 130ms compared with the previous patch which is around 1.5 s. When I then add
sorting on string field the query time is around 220 ms compared with the previous patch which
is around 5.2 s. 

The reason why it is faster is because the latest patch queries for a doclist instead of a
docset. In the normal collapse method it keeps track of the most relevant documents, so the
end result is the same, also creating a docList of 1.1 million documents (and ordering it)
is very expensive.

Note: I did not improved adjacent collapsing, because the adjacent method needs (as far as
I understand it) a completely sorted list of documents (docList).

2) Slightly improved facetation in combination with field collapsing, by reusing the uncollapsed
docset that is created during the collapsing process (the previous patch made invoked a second
search).

I also have added documentation, added a few unit tests for the collapsing process itself
and made the debug information more readable.

I'm very interested in other people's experiences with this patch and feedback on the patch
itself. 

Cheers,

Martijn 


      was (Author: martijn):
    Hi,

I have modified the latest patch of Thomas and made two performance improvements: 
1) Improved normal field collapsing. I tested it with an index 1.1 million documents. When
collapsing on all documents and with no sorting specified (so sorting on score) the query
time is around 130ms compared with the previous patch which is around 1.5 s. When I then add
sorting on string field the query time is around 220 ms compared with the previous patch which
is around 5.2 s. 

The reason why it is faster is because the latest patch queries for a doclist instead of a
docset. In the normal collapse method it keeps track of the most relevant documents, so the
end result is the same, also creating a docList of 1.1 million documents (and ordering it)
is very expensive.

Note: I did not improved adjacent collapsing, because the adjacent method needs (as far as
I understand it) a completely sorted list of documents (docList).

2) Sightly improved facetation in combination with field collapsing, by reusing the uncollapsed
docset that is created during the collapsing process (the previous patch made invoked a second
search).

I also have added documentation, added a few unit tests for the collapsing process itself
and made the debug information easier readable.

I'm very interested in other people's experiences with this patch and feedback on the patch
itself. 

Cheers,

Martijn 

  
> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch,
collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-solr-236.patch,
field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch,
field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to
a single entry in the result set. Site collapsing is a special case of this, where all results
for a given web site is collapsed into one or two entries in the result set, typically with
an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message