lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdul Chaudhry (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-236) Field collapsing
Date Fri, 04 Sep 2009 00:57:58 GMT

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751243#action_12751243
] 

Abdul Chaudhry edited comment on SOLR-236 at 9/3/09 5:56 PM:
-------------------------------------------------------------

I have some ideas for performance improvements.

I noticed that the code fetches the field cache twice, once for the collapse and then for
the response object, assuming you asked for the info count in the response.

That seems expensive, especially for real-time content.

I think its better to use FieldCache.StringIndex instead of returning a large string array
and keep it around for the collapse and the response object.

I changed the code so that I keep the cache around like so

  /**
   * Keep the field cached for the collapsed fields for the response object as well
   */
  private FieldCache.StringIndex collapseIndex;


To get the index use something like this instead of getting the string array for all docs

collapseIndex = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), collapseField)

when collapsing , you can get the current value using something like this and remove the code
passing the array

      int currentId = i.nextDoc();
      String currentValue = collapseIndex.lookup[collapseIndex.order[currentId]];

when building the response for the info count, you can reference the same cache like so:-

          if (collapseInfoCount) {
            resCount.add(collapseFieldType.indexedToReadable(
              collapseIndex.lookup[collapseIndex.order[id]]), count);
          }

I also added timing for the cache access as it could be slow if you are doing a lot of updates

I have added code for displaying selected fields for the duplicates but its difficult to submit
. I hope this gets committed as its hard to sumbit  a patch as its not in svn and I cannot
submit a patch to a patch to a patch .. you get the idea.



      was (Author: abdollar):
    I have some ideas for performance improvements.

I noticed that the code fetches the field cache twice, once for the collapse and then for
the response object, assuming you asked for the info count in the response.

That seems expensive, especially for real-time content.

I think its better to use FieldCache.StringIndex instead of returning a large string array
and keep it around for the collapse and the response object.

I changed the code so that I keep the cache around like so

  /**
   * Keep the field cached for the collapsed fields for the response object as well
   */
  private FieldCache.StringIndex collapseIndex;


when collapsing , you can get the current value using something like this and remove the code
passing the array

      int currentId = i.nextDoc();
      String currentValue = collapseIndex.lookup[collapseIndex.order[currentId]];

when building the response for the info count, you can reference the same cache like so:-

          if (collapseInfoCount) {
            resCount.add(collapseFieldType.indexedToReadable(
              collapseIndex.lookup[collapseIndex.order[id]]), count);
          }

I also added timing for the cache access as it could be slow if you are doing a lot of updates

I have added code for displaying selected fields for the duplicates but its difficult to submit
. I hope this gets committed as its hard to sumbit  a patch as its not in svn and I cannot
submit a patch to a patch to a patch .. you get the idea.

  
> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch,
collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch,
field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch,
field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to
a single entry in the result set. Site collapsing is a special case of this, where all results
for a given web site is collapsed into one or two entries in the result set, typically with
an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message