Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 617DC200B41 for ; Thu, 7 Jul 2016 11:48:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5FF30160A68; Thu, 7 Jul 2016 09:48:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 81887160A59 for ; Thu, 7 Jul 2016 11:48:54 +0200 (CEST) Received: (qmail 16617 invoked by uid 500); 7 Jul 2016 09:48:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 16604 invoked by uid 99); 7 Jul 2016 09:48:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jul 2016 09:48:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3C422C045C for ; Thu, 7 Jul 2016 09:48:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=sumologic.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id HXfqfDid7Qom for ; Thu, 7 Jul 2016 09:48:50 +0000 (UTC) Received: from mail-yw0-f182.google.com (mail-yw0-f182.google.com [209.85.161.182]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id D8D935F242 for ; Thu, 7 Jul 2016 09:48:49 +0000 (UTC) Received: by mail-yw0-f182.google.com with SMTP id b72so9781879ywa.3 for ; Thu, 07 Jul 2016 02:48:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sumologic.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=/sl0ZM4nPdhg9a174CaJt+PSp9Cm7tHhlJbHSBgiMxw=; b=jwUHR7cyFca/DQX1m9BibyAAkWUn8figcrWc0W0msJT6oppbxUec8y13+oDqh7Fqtm 18c6cP6GgW+y2QAXfSwKChkBdUjjFzNJQpBgyL71Pt6OA4LpgEF+/9Ib1FtfSS3evOes IMKC7RgbumJDqMewJ8Wks2IMb/dtE8TcYRmaM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=/sl0ZM4nPdhg9a174CaJt+PSp9Cm7tHhlJbHSBgiMxw=; b=ZJEbv5o/07P9n8Y/IzYncXKO7tnDObM8pe4UAw1lp9VjSWRaEO+jO85lIIC+BokelG HPKgW0TC7q8ZER1w5RWAjgz9Oa3pWe4eYg6YcbapqdremfAgd65GM0HQQsC1Aq0Plu4v Who043j2NXOD8+ks3EI+eTH1zoJu6cb7XdBXsxw+98IhyUNLZ5wIohQIUGnCHVQDN3ds q63fMDKVEFhcQzjEgyTIgKo1burjXQPc5QlSayESE2FZXTVE9tKid/LRGKxB+s3tTOlW D2vTW3G7fYQikmMyQX9f1WW1RUcIrWQ/z3iKJklw4uvl9kxWT85CYLa7XuTWszqXmScP Y08Q== X-Gm-Message-State: ALyK8tL+6wBfJJvoSJNRY+Ywram99tjo0NrbjBdZW79hk/eDKVrL7qQTsVUkTShRAlw1/2AYSDoJpHZJsBYTyU84 X-Received: by 10.37.3.202 with SMTP id 193mr17124795ybd.130.1467884923191; Thu, 07 Jul 2016 02:48:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.26.139 with HTTP; Thu, 7 Jul 2016 02:48:13 -0700 (PDT) In-Reply-To: References: From: Tarun Kumar Date: Thu, 7 Jul 2016 15:18:13 +0530 Message-ID: Subject: Re: lucene index reader performance To: Michael McCandless Cc: Lucene Users Content-Type: multipart/alternative; boundary=001a11c03a4e6b42c005370897dc archived-at: Thu, 07 Jul 2016 09:48:55 -0000 --001a11c03a4e6b42c005370897dc Content-Type: text/plain; charset=UTF-8 Any suggestions pls? On Mon, Jul 4, 2016 at 3:37 PM, Tarun Kumar wrote: > Hey Michael, > > docIds from multiple indices (from multiple machines) need to be > aggregated, sorted and first few thousand new to be queried. These few > thousand docs can be distributed among multiple machines. Each machine will > search the docs which are there in their own indices. So, pulling sorting > on server side won't suffice the use-case. Is there a alternative to get > document for given docIds faster? > > Thanks > Tarun > > On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless < > lucene@mikemccandless.com> wrote: > >> Why not ask Lucene to do the sort on your time field, instead of pulling >> millions of docids to the client and having it sort. You could even do >> index-time sorting by time field if you want, which makes early termination >> possible (faster sorted searches). >> >> But if even on having Lucene do the sort you still need to load millions >> of documents per search request, you are in trouble: you need to >> re-formulate your use case somehow to take advantage of what Lucene is good >> for (getting top results for a search). >> >> Maybe you can use faceting to do whatever aggregation you are currently >> doing after retrieving those millions of documents. >> >> Maybe you could make a custom collector, and use doc values, to do your >> own custom aggregation. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar wrote: >> >>> Thanks for reply Michael! In my application, i need to get millions of >>> documents per search. >>> >>> Use case is following: return documents in increasing order of field >>> time. Client (caller) can't hold more than a few thousand docs at a time so >>> it gets all docIds and corresponding time field for each doc, sort them on >>> time and get n docs at a time. To support this usecase, i am: >>> >>> - getting all docsIds first. >>> - Sort docIds on time fields. >>> - Query n docids at a time from client which make >>> indexReader.document(docId) call for all n docs at server, combine the docs >>> these docs and return. >>> >>> indexReader.document(docId) is creating bottlenecks. What alternatives >>> do you suggest? >>> >>> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless < >>> lucene@mikemccandless.com> wrote: >>> >>>> Are you maybe trying to load too many documents for each search request? >>>> >>>> The IR.document API is designed to be used to load just a few hits, >>>> like a page worth or ~ 10 documents, per search. >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar >>>> wrote: >>>> >>>>> I am running lucene 4.6.1. I am trying to get documents corresponding >>>>> to >>>>> docIds. All threads get stuck (don't get stuck exactly but spend a LOT >>>>> of >>>>> time in) at: >>>>> >>>>> java.lang.Thread.State: RUNNABLE >>>>> at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) >>>>> at >>>>> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52) >>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220) >>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >>>>> at >>>>> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731) >>>>> at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716) >>>>> at >>>>> >>>>> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169) >>>>> at >>>>> >>>>> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271) >>>>> at >>>>> >>>>> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51) >>>>> at >>>>> org.apache.lucene.store.DataInput.readVInt(DataInput.java:108) >>>>> at >>>>> >>>>> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218) >>>>> at >>>>> >>>>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232) >>>>> at >>>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277) >>>>> at >>>>> >>>>> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110) >>>>> at >>>>> org.apache.lucene.index.IndexReader.document(IndexReader.java:440) >>>>> >>>>> >>>>> There is no disk throttling. What can result into this? >>>>> >>>>> Thanks >>>>> Tarun >>>>> >>>> >>>> >>> >> > --001a11c03a4e6b42c005370897dc--