lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: solr 1.4 facet.limit behaviour in merging from several shards
Date Wed, 21 Sep 2011 05:46:06 GMT

: document in a shard has a field, which contains date in milliseconds which
: is a result of subtraction of the original document's date from a very big
: date in the future. In this way, if you issue a facet query against a shard
: and use facet.method=index you get hits from the shard ordered
: lexicographically in reverse order.
: Here is an example of two values:
: 9223370739060532807_docid1
: 9223370741484545807_docid2

you've explainged what the 9223370739060532807 part is, but i don't 
understand the "docid1" part ... is that the primary key for this 
document? ... that doesn't really make sense since in your example the 
constraint count is sometimes "2" ... so i don't really understand what 
this value represents ... more on this below...

: Here is a typical facet query:
: wt=xml&start=0&hl.alternateField=Contents&version=1&df=Contents&q=aerospace+engineer&hl.alternateFieldLength=100000&facet=true&f.OppositeDateLongNumber_docid.facet.limit=1000&facet.field=OppositeDateLongNumber_docid&rows=1&facet.sort=index&facet.zeros=false&isShard=true


: Back to the problem: It has been reproducible, that if query ran from the
: solr - router reaches two or more shards, each of which generates around
: 1000 hits, upon merging, some portion of hits (on the time border between
: two shards) gets dropped. So the result hit list is uniform otherwise,
: except for the missing portion of hits in the middle.
: So the question is: if the facet search reaches two or more shards and each
: shard generates 1000 results, which entries will go into the final list of
: resulting entries, given the facet.limit=1000 set on the original
: distributed query? What is the algorithm in this case?

Unless there is a bug, the algorithm is (suppose to be) that the list you 
get back is exactly what you've asked for:  you've asked for at most 1000 
constraints, with non-zero values, in "index" order.  

In your case, assuming none of the "long" values you get from your 
subtraction will result in a variable number of characters, "index order" 
is numerically increasing.  so you will get the "first" 1000 constraint 
values from that field back -- if your "first" shard (in your date/number 
based partitioning schema) contains enough matching documents that there 
are at least 1000 constraint values, then they will be the full 1000 
constraints returned. if it doesn't return a full 1000, then some will be 
taken from the next shard until you get 1000

(this isn't hte general algorithm, but this is how it would work for your 
index given thta (according to your description) there is no overlap 
between these values in the shards)

with the setup you describe, there's no why i can imagine executing a 
search that results in constraints being returned that come from multiple 
shards with some constraints being "missing" from the middle of hte list, 
near the border of values for that field that signify a change in shard.

are the missing values from the "lower" shard or the "higher" shard?

can you provide a reproducable example?  some test data that when loaded 
into the example schemas and run on two shards produces the types of 
results you are seeing?

Furthermore: what you are doing is a *really* wacky use of faceting ... i 
have honestly never seen anything like it, hence my question about the 
significance of the "docid1" and "docid2" in your example field values -- 
can you elaborate on what these values mean, and how you are ultimately 
using the facet results you get back?  because i am seriously curious as 
to your use case, and more then a little suspicios that there might be a 
simpler and more efficient way to solve whatever use case you have...
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also:


View raw message