lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tarun Kumar <ta...@sumologic.com>
Subject Re: lucene index reader performance
Date Mon, 04 Jul 2016 10:07:34 GMT
Hey Michael,

docIds from multiple indices (from multiple machines) need to be
aggregated, sorted and first few thousand new to be queried. These few
thousand docs can be distributed among multiple machines. Each machine will
search the docs which are there in their own indices. So, pulling sorting
on server side won't suffice the use-case. Is there a alternative to get
document for given docIds faster?

Thanks
Tarun

On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Why not ask Lucene to do the sort on your time field, instead of pulling
> millions of docids to the client and having it sort.  You could even do
> index-time sorting by time field if you want, which makes early termination
> possible (faster sorted searches).
>
> But if even on having Lucene do the sort you still need to load millions
> of documents per search request, you are in trouble: you need to
> re-formulate your use case somehow to take advantage of what Lucene is good
> for (getting top results for a search).
>
> Maybe you can use faceting to do whatever aggregation you are currently
> doing after retrieving those millions of documents.
>
> Maybe you could make a custom collector, and use doc values, to do your
> own custom aggregation.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar <tarun@sumologic.com> wrote:
>
>> Thanks for reply Michael! In my application, i need to get millions of
>> documents per search.
>>
>> Use case is following: return documents in increasing order of field
>> time. Client (caller) can't hold more than a few thousand docs at a time so
>> it gets all docIds and corresponding time field for each doc, sort them on
>> time and get n docs at a time. To support this usecase, i am:
>>
>> - getting all docsIds first.
>> - Sort docIds on time fields.
>> - Query n docids at a time from client which make
>> indexReader.document(docId) call for all n docs at server, combine the docs
>> these docs and return.
>>
>> indexReader.document(docId) is creating bottlenecks. What alternatives do
>> you suggest?
>>
>> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> Are you maybe trying to load too many documents for each search request?
>>>
>>> The IR.document API is designed to be used to load just a few hits, like
>>> a page worth or ~ 10 documents, per search.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar <tarun@sumologic.com>
>>> wrote:
>>>
>>>> I am running lucene 4.6.1. I am trying to get documents corresponding to
>>>> docIds. All threads get stuck (don't get stuck exactly but spend a LOT
>>>> of
>>>> time in) at:
>>>>
>>>> java.lang.Thread.State: RUNNABLE
>>>>         at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
>>>>         at
>>>> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
>>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
>>>>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>         at
>>>> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731)
>>>>         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716)
>>>>         at
>>>>
>>>> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169)
>>>>         at
>>>>
>>>> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271)
>>>>         at
>>>>
>>>> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51)
>>>>         at
>>>> org.apache.lucene.store.DataInput.readVInt(DataInput.java:108)
>>>>         at
>>>>
>>>> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218)
>>>>         at
>>>>
>>>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232)
>>>>         at
>>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277)
>>>>         at
>>>>
>>>> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110)
>>>>         at
>>>> org.apache.lucene.index.IndexReader.document(IndexReader.java:440)
>>>>
>>>>
>>>> There is no disk throttling. What can result into this?
>>>>
>>>> Thanks
>>>> Tarun
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message