lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tarun Kumar <ta...@sumologic.com>
Subject Re: lucene index reader performance
Date Thu, 07 Jul 2016 09:48:13 GMT
Any suggestions pls?

On Mon, Jul 4, 2016 at 3:37 PM, Tarun Kumar <tarun@sumologic.com> wrote:

> Hey Michael,
>
> docIds from multiple indices (from multiple machines) need to be
> aggregated, sorted and first few thousand new to be queried. These few
> thousand docs can be distributed among multiple machines. Each machine will
> search the docs which are there in their own indices. So, pulling sorting
> on server side won't suffice the use-case. Is there a alternative to get
> document for given docIds faster?
>
> Thanks
> Tarun
>
> On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Why not ask Lucene to do the sort on your time field, instead of pulling
>> millions of docids to the client and having it sort.  You could even do
>> index-time sorting by time field if you want, which makes early termination
>> possible (faster sorted searches).
>>
>> But if even on having Lucene do the sort you still need to load millions
>> of documents per search request, you are in trouble: you need to
>> re-formulate your use case somehow to take advantage of what Lucene is good
>> for (getting top results for a search).
>>
>> Maybe you can use faceting to do whatever aggregation you are currently
>> doing after retrieving those millions of documents.
>>
>> Maybe you could make a custom collector, and use doc values, to do your
>> own custom aggregation.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar <tarun@sumologic.com> wrote:
>>
>>> Thanks for reply Michael! In my application, i need to get millions of
>>> documents per search.
>>>
>>> Use case is following: return documents in increasing order of field
>>> time. Client (caller) can't hold more than a few thousand docs at a time so
>>> it gets all docIds and corresponding time field for each doc, sort them on
>>> time and get n docs at a time. To support this usecase, i am:
>>>
>>> - getting all docsIds first.
>>> - Sort docIds on time fields.
>>> - Query n docids at a time from client which make
>>> indexReader.document(docId) call for all n docs at server, combine the docs
>>> these docs and return.
>>>
>>> indexReader.document(docId) is creating bottlenecks. What alternatives
>>> do you suggest?
>>>
>>> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>> Are you maybe trying to load too many documents for each search request?
>>>>
>>>> The IR.document API is designed to be used to load just a few hits,
>>>> like a page worth or ~ 10 documents, per search.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar <tarun@sumologic.com>
>>>> wrote:
>>>>
>>>>> I am running lucene 4.6.1. I am trying to get documents corresponding
>>>>> to
>>>>> docIds. All threads get stuck (don't get stuck exactly but spend a LOT
>>>>> of
>>>>> time in) at:
>>>>>
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>         at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
>>>>>         at
>>>>> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
>>>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
>>>>>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>>         at
>>>>> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731)
>>>>>         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716)
>>>>>         at
>>>>>
>>>>> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169)
>>>>>         at
>>>>>
>>>>> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271)
>>>>>         at
>>>>>
>>>>> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51)
>>>>>         at
>>>>> org.apache.lucene.store.DataInput.readVInt(DataInput.java:108)
>>>>>         at
>>>>>
>>>>> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218)
>>>>>         at
>>>>>
>>>>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232)
>>>>>         at
>>>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277)
>>>>>         at
>>>>>
>>>>> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110)
>>>>>         at
>>>>> org.apache.lucene.index.IndexReader.document(IndexReader.java:440)
>>>>>
>>>>>
>>>>> There is no disk throttling. What can result into this?
>>>>>
>>>>> Thanks
>>>>> Tarun
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message