lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Sturge <peter.stu...@gmail.com>
Subject Re: Handling intersection facets of many values
Date Wed, 19 Nov 2014 20:26:01 GMT
Hi Toke,
Thanks for your input.

I guess you mean take the 1k or so values and build a boolean query from
them?
If that's not what you mean, my apologies..
I'd thought of doing that - the trouble I had was
the unique values could be 20k, or 15,167 or any arbirary and potentially
high-ish number - it's not really known and can/will change over time. I
believe a boolean query with more than 1024 ops can blow up the query, so
scalability is a concern.
The other issue is how this would yield the unique facet values -
e.g. dest=8.8.8.8 (17) [i.e. 8.8.8.8 is in the 'addr' list and occurs 17
times in entries with a 'dest' field] - in fact, I need the uniques
value(s) ('8.8.8.8') more than I need the count ('17')

I could get the facet list of 'dest' values, then trawl through each one,
but this will be a complicated and time-consuming client-side operation.
I'm also looking at creating a custom QueryParser that would build the
relevant DocLists, then intersect them and return the values, but I
wouldn't want to reinvent the wheel if possible, given that facets already
build unique term lists, seems so close - I guess it's like taking two
facet lists (1 for addr, 1 for dest), intersecting them and returning the
result:

List 1:
a
b
c
d
e
f

List 2:
a
a
g
z
c
c
c
e

Resultant intersection:
a (2)
c (3)
e (1)


Thanks,
Peter



On Wed, Nov 19, 2014 at 7:16 PM, Toke Eskildsen <te@statsbiblioteket.dk>
wrote:

> Peter Sturge [peter.sturge@gmail.com] wrote:
>
> [addr 7M unique, dest 1K unique]
>
> > What is the best/only/most efficient way to consutruct a search where by
> I
> > get back an (ideally faceted) list of values for 'dest' that occur in
> > 'addr'?
>
> I assume the actual values are defined by a query? As the number of
> possible values in dest is not that large, extracting those first and then
> using them as a filter when searching for addr seems like a fairly
> efficient way of solving the problem.
>
> - Toke Eskildsen
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message