lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: range queries on string field with millions of values
Date Sun, 30 Nov 2008 14:52:54 GMT
On Sun, Nov 30, 2008 at 2:04 AM, Naomi Dushay <ndushay@stanford.edu> wrote:
> The terms component approach, if i understand it correctly, will be
> problematic.  I need to present not only the next X call numbers in
> sequence, but other fields in those documents (e.g. title, author).

You can still use the method Hoss suggested of doing 2 requests to
satisfy this type of search:

>> But as Yonik said: the new TermsComponent may actually be a better option
>> for you -- doing two requests for every page (the first to get the N Terms
>> in your id field starting with your input, the second to do an query for
>> docs matching any of those N ids) might actually be faster even though
>> there won't likely even be any cache hits.

So TermsComponent gets the next 10 IDs, then you do a standard query
with those 10 IDs.

-Yonik


> assume the Terms Component approach will only give me the next X call number
> values, not the documents.
>
> It sounds like Glen Newton's suggestion of mapping the call numbers to a
> float number is the most likely solution.
>
> I know it sounds ridiculous to do all this for a "call number browse" but
> our faculty have explicitly asked for this.  For humanities scholars
> especially, they know the call numbers that are of interest to them, and
> they browse the stacks that way (ML 1500s are opera, V35 is verdi ...).
> They are using the research methods that have been successful for their
> entire careers.  Plus, library materials are going to off site, high density
> storage, so the only way for them to to browse all materials, regardless of
> location, via call number is online.   I doubt they'll find this feature as
> useful as they expect, but it behooves us to give the users what they ask
> for.
>
> So yeah, our user needs are perhaps a little outside of your expectations.
>  :-)
>
> - Naomi
>
>
> On Nov 29, 2008, at 2:58 PM, Chris Hostetter wrote:
>
>>
>> : The results are correct.  But the response time sucks.
>> :
>> : Reading the docs about caches, I thought I could populate the query
>> result
>> : cache with an autowarming query and the response time would be okay.
>>  But that
>> : hasn't worked.  (See excerpts from my solrConfig file below.)
>> :
>> : A repeated query is very fast, implying caching happens for a particular
>> : starting point ("42" above).
>> :
>> : Is there a way to populate the cache with the ENTIRE sorted list of
>> values for
>> : the field, so any arbitrary starting point will get results from the
>> cache,
>> : rather than grabbing all results from (x) to the end, then sorting all
>> these
>> : results, then returning the first 10?
>>
>> there's two "caches" that come into play for something like this...
>>
>> the first cache is a low level Lucene cache called the "FieldCache" that
>> is completley hidden from you (and for the most part: from Solr).
>> anytime you sort on a field, it get's built, and reuse for all sorts on
>> that field.  My originl concern was that it wasn't getting warmed on
>> "newSearcher" (because you have to be explicit about that.
>>
>> the second cache is the queryResultsCache which caches a "window" of an
>> ordered list of documents based on a query, and a sort.  you can see this
>> cache in your Solr stats, and yes: these two requests results in different
>> cache keys for the queryResultsCache...
>>
>>       q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
>>       q=yourField:[52+TO+*]&sort=yourField+asc&rows=10
>>
>> ...BUT! ... the two queries below will result in the same cache key, and
>> the second will be a cache hit, provided a sufficient value for
>> the "queryResultWindowSize" ...
>>
>>       q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
>>       q=yourField:[42+TO+*]&sort=yourField+asc&rows=10&start=10
>>
>> so perhaps the key to your problem is to just make sure that once the user
>> gives you an id to start with, you "scroll" by increasing the start param
>> (not altering the id) ... the first query might be "slow" but every query
>> after that should be a cache hit (depending on your page size, and how far
>> you expect people to scroll, you should consider increasing
>> queryResultWindowSize)
>>
>> But as Yonik said: the new TermsComponent may actually be a better option
>> for you -- doing two requests for every page (the first to get the N Terms
>> in your id field starting with your input, the second to do an query for
>> docs matching any of those N ids) might actually be faster even though
>> there won't likely even be any cache hits.
>>
>>
>> My opinion:  Your use case sounds like a waste of effort.  I can't imagine
>> anyone using a library catalog system ever wanting to lookup a callnumber,
>> and then scroll through all posisble books with similar call numbers -- it
>> seems much more likely that i'd want to look at other books with similar
>> authors, or keywords, or tags ... all things that are actaully *easier* to
>> do with Solr.  (but then again: i don't work in a library.  i trust that
>> you know something i don't about what your users want.)
>>
>>
>> -Hoss
>>
>
> Naomi Dushay
> ndushay@stanford.edu
>
>
>
>

Mime
View raw message