lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandro Zbinden <zbin...@imagic.ch>
Subject Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot faceting requests with facet.pivot.ngroup=true and facet.pivot.showLastList=false
Date Wed, 17 Jul 2013 11:44:32 GMT
Dear Usergroup


I am getting an out of memory exception in the following scenario.
I have 4 sql tables: patient, visit, study and image that will be denormalized for the solr
index
The solr index looks like the following


--------------------------------------------
|p_id |p_lastname|v_id  |v_name  |...
--------------------------------------------
| 1      | Miller            | 10     | Study 1   |...
| 2      | Miller            | 11     | Study 2   |...
| 2      | Miller            | 12     | Study 3   |...  <-- Duplication because of denormalization
| 3      | Smith            | 13     | Study 4  |...
----------------------------------

Now I am executing a facet query

q=*:*&facet=true &facet.pivot=p_lastname,p_id &facet.limit=-1

And I get the following result

<lst>
<str name="field">p_lastname</str>
<str name="value">Miller</str>
<int name="count">3</int>
<arr name="pivot">
  <lst>
   <str name="field">p_id</str>
   <int name="value">1</int>
   <int name="count">1</int>
  </lst>
  <lst>
   <str name="field">p_id</str>
   <int name="value">2</int>
   <int name="count">2</int>
  </lst>
</arr>
</lst>
<lst>
<str name="field">p_lastname</str>
<str name="value">Smith</str>
<int name="count">1</int>
<arr name="pivot">
   <str name="field">p_id</str>
   <int name="value">3</int>
   <int name="count">1</int>
  </lst>
</arr>
</lst>


The goal is to show our clients a list of the group value and in parentheses how many patients
the group contains.
 - Miller (2)
- Smith (1)

This is why we need to use the facet.pivot method with facet.limit-1. It is as far as I know
the only way to get a grouping for 2 criterias.
And we need the pivot list to count how many patients are in a group.


Currently this works good on smaller indexes but if we have arround 1'000'000 patients and
we execute a query like the one above we run in an out of memory.
I figured out that the problem is not the calculation of the pivot but is the presentation
of the result.
Because we load all fields (we can not us facet.offset because we need to order the results
ascending and descending) the result can get really big.

To avoid this overload I created a change in the solr-core PivotFacetHandler.java class.
In the method doPivots i added the following code

   NamedList<Integer> nl = this.getTermCounts(subField);
   pivot.add( "ngroups", nl.size());

This will give me the group size of the list.
Then I removed the recursion call pivot.add( "pivot", doPivots( nl, subField, nextField, fnames,
subset) );
Like this my result looks like the following

<lst>
<str name="field">p_lastname</str>
<str name="value">Miller</str>
<int name="count">3</int>
<int name="ngroup">2</int>
</lst>
<lst>
<str name="field">p_lastname</str>
<str name="value">Smith</str>
<int name="count">1</int>>
<int name="ngroup">1</int>
</lst>


My questions is now if there is already something planned like facet.pivot.ngroup=true and
facet.pivot.showLastList=false to improve the performance
of pivot faceting.

Is there a chance we could get this into the solr code. I think it's a really small change
of the code but could improve the product enormous.

Best Regards

Sandro Zbinden


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message