[ https://issues.apache.org/jira/browse/SOLR3583?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13403331#comment13403331
]
Chris Russell edited comment on SOLR3583 at 6/28/12 6:29 PM:

This patch builds upon the distributed pivot facets introduced in SOLR2894 and adds the ability
to request rudimentary percentiles when faceting. The percentiles are calculated by using
range facets to create "buckets" which divide up the field in question. A range facet is
done on each bucket to determine the number of documents whose value falls within that bucket.
An average value for each bucket is determined by averaging the upper and lower bound of
that bucket. The count of documents for each bucket as well as the bucket average are used
when determining percentiles, with the bucket average being returned as the percentile value.
Thus the accuracy of the value is determined by bucket size. Smaller buckets will yield
more accurate values but will be more computationally intensive.
The choice to use buckets and have "fuzzy" values was made because 1) We were using query
facets to do this already and desired a solution that involved less querying and 2) Our use
case involves document counts on the order of tens of millions and distributed coalescing
distinct values during distributed search seemed problematic from a performance standpoint.
Usage:
Querying:
Faceting must be enabled (facet=true). Then you may use the following parameters to define
your percentiles request:
percentiles=true : enables facet statistics
percentiles.field=fieldname : field to calculate facets for; can be specified more than
once
percentiles.requested.percentiles=25,50,75 : requested percentiles i.e. 25th,50th,75th
percentiles.lower.fence=0 : lower bound for percentiles calculation i.e. lower edge of first
bucket
percentiles.upper.fence=5000 : upper bound for percentiles calculation i.e. upper edge of
last bucket
percentiles.gap=10 : bucket size i.e. bucket1 010, bucket2 1020, etc (double counting
on edges avoided similar to range facets)
percentiles.averages=true : set this if you would like average and doc count reported for
each field (average is weighted average of bucket midpoints)
facet.pivot=field1,field2 : if you ask for pivots, percentiles will show up on a perpivot
basis!
Here is an example URL using the example documents included with solr:
http://localhost:8983/solr/select?q=*%3A*&start=0&rows=3&wt=xml&facet=true&percentiles=true&percentiles.field=popularity&percentiles.requested.percentiles=25,50,75&percentiles.averages=true&facet.field=price&facet.field=popularity&facet.pivot=manufacturedate_dt&f.popularity.percentiles.lower.fence=0&f.popularity.percentiles.upper.fence=11&f.popularity.percentiles.gap=1&facet.sort=index&percentiles.field=price&percentiles.lower.fence=0&percentiles.upper.fence=5000&percentiles.gap=10
Results format:
If percentiles are requested the "facet_statistics" node will show under "facet_counts".
Each field requested will have its own subsection. Each subsection will contain percentiles
and optionally average and count.
If pivot facets are also requested, each level of pivot will have a "statistics" section
that will contain perfield info similar to that found in "facet_statistics" above.
Notes:
All field types that range facets support are supported, however average on a date field
will always return as 0. Apologies.
Works in distributed mode!
Includes a unit test.
If you're curious about what settings are used internally for the range faceting, it is:
rangeHardEnd = false;
includeLower = true;
includeUpper = false;
includeEdge = false;
was (Author: selah):
This patch builds upon the distributed pivot facets introduced in SOLR2894 and adds the
ability to request rudimentary percentiles when faceting. The percentiles are calculated
by using range facets to create "buckets" which divide up the field in question. A range
facet is done on each bucket to determine the number of documents whose value falls within
that bucket. An average value for each bucket is determined by averaging the upper and lower
bound of that bucket. The count of documents for each bucket as well as the bucket average
are used when determining percentiles, with the bucket average being returned as the percentile
value. Thus the accuracy of the value is determined by bucket size. Smaller buckets will
yield more accurate values but will be more computationally intensive.
The choice to use buckets and have "fuzzy" values was made because 1) We were using query
facets to do this already and desired a solution that involved less querying and 2) Our use
case involves document counts on the order of tens of millions and distributed coalescing
distinct values during distributed search seemed problematic from a performance standpoint.
Usage:
Querying:
Faceting must be enabled (facet=true). Then you may use the following parameters to define
your percentiles request:
percentiles=true : enables facet statistics
percentiles.field=fieldname : field to calculate facets for; can be specified more than
once
percentiles.requested.percentiles=25,50,75 : requested percentiles i.e. 25th,50th,75th
percentiles.lower.fence=0 : lower bound for percentiles calculation i.e. lower edge of first
bucket
percentiles.upper.fence=5000 : upper bound for percentiles calculation i.e. upper edge of
last bucket
percentiles.gap=10 : bucket size i.e. bucket1 010, bucket2 1020, etc (double counting
on edges avoided similar to range facets)
percentiles.averages=true : set this if you would like average and doc count reported for
each field (average is weighted average of bucket midpoints)
facet.pivot=field1,field2 : if you ask for pivots, percentiles will show up on a perpivot
basis!
Here is an example URL using the example documents included with solr:
http://localhost:8983/solr/select?q=*%3A*&start=0&rows=3&wt=xml&facet=true&percentiles=true&percentiles.field=popularity&percentiles.requested.percentiles=25,50,75&percentiles.averages=true&facet.field=price&facet.field=popularity&facet.pivot=manufacturedate_dt&f.popularity.percentiles.lower.fence=0&f.popularity.percentiles.upper.fence=11&f.popularity.percentiles.gap=1&facet.sort=index&percentiles.field=price&percentiles.lower.fence=0&percentiles.upper.fence=5000&percentiles.gap=10
Results format:
If percentiles are requested the "facet_statistics" node will show under "facet_counts".
Each field requested will have its own subsection. Each subsection will contain percentiles
and optionally average and count.
If pivot facets are also requested, each level of pivot will have a "statistics" section
that will contain perfield info similar to that found in "facet_statistics" above.
Notes:
All field types that range facets support are supported, however average on a date field
will always return as 0. Apologies.
Works in distributed mode!
Includes a unit test.
> Percentiles for facets, pivot facets, and distributed pivot facets
> 
>
> Key: SOLR3583
> URL: https://issues.apache.org/jira/browse/SOLR3583
> Project: Solr
> Issue Type: Improvement
> Reporter: Chris Russell
> Priority: Minor
> Labels: newbie, patch
> Fix For: 4.0
>
> Attachments: SOLR3583.patch
>
>
> Built on top of SOLR2894 (includes Apr 25th version) this patch adds percentiles and
averages to facets, pivot facets, and distributed pivot facets by making use of range facet
internals.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
