Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 503C6177AA for ; Sun, 9 Aug 2015 15:53:16 +0000 (UTC) Received: (qmail 84362 invoked by uid 500); 9 Aug 2015 15:53:14 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 84304 invoked by uid 500); 9 Aug 2015 15:53:14 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 84293 invoked by uid 99); 9 Aug 2015 15:53:14 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Aug 2015 15:53:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C9BB81A9B95 for ; Sun, 9 Aug 2015 15:53:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.001 X-Spam-Level: **** X-Spam-Status: No, score=4.001 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id bWcYAo-eRSlj for ; Sun, 9 Aug 2015 15:53:01 +0000 (UTC) Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 48A0E2092B for ; Sun, 9 Aug 2015 15:53:01 +0000 (UTC) Received: by wibxm9 with SMTP id xm9so120906568wib.1 for ; Sun, 09 Aug 2015 08:52:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=Kw9fP06Ekr9fCt5QF8J0cCN2OWDbfrBpyt4Ks4ahLz0=; b=WUtDiWzkdZxJOaa9w+c2J/WbtqAFzrBN8dgaGpQad32g4O4+PcWna02zOK3XMWzJ5+ e7TJB0teGrl9C4YFrcNHav7C7VzmkhN3pINmkKzFZ8WMEgfklvanX8mUpnz7LKIGm0je poLPSawPMsELJFAt596XLPa5bShdxKhgiyAE3xjwif6kWOhWHs7YLZVeLh8TsvS2LVv8 MffBseiOZKnNT+Y1gYnW56SJ5sLVAq/kChm6wHCd+9oNnkZf0IZI0sy9iQpPq/A7BKkh bLETnO3j5AYn7XhrR/uNtIQZOW8mZ6AeJK4tolyo9m4nUbE79+7Yqhv+rUjokRifbZMu MBLg== X-Gm-Message-State: ALoCoQl/P08iE/ulVk363Ia+WubwtEYmDsPmVEYA/56oSp2crof1/yx0CoHJwy3muN3f14BqBy19 X-Received: by 10.180.78.136 with SMTP id b8mr15560404wix.44.1439135534184; Sun, 09 Aug 2015 08:52:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.100.5 with HTTP; Sun, 9 Aug 2015 08:51:54 -0700 (PDT) In-Reply-To: References: From: =?UTF-8?B?QW5kcsOhcyBQw6l0ZXJp?= Date: Sun, 9 Aug 2015 17:51:54 +0200 Message-ID: Subject: Re: Mapping doc values back to doc ID (in decent time) To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=f46d0435c0344cb5d3051ce2da3f --f46d0435c0344cb5d3051ce2da3f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable If I understand it correctly, the Zoie library [1][2] implements the "sledgehammer" approach by collecting docValues for all documents when a segment reader is opened. If you have some RAM to throw at the problem, this could indeed bring you an acceptable level of performance. [1] http://senseidb.github.io/zoie/ [2] https://github.com/senseidb/zoie/blob/master/zoie-core/src/main/java/proj/z= oie/api/impl/DocIDMapperImpl.java On Sun, Aug 9, 2015 at 9:41 AM, Trejkaz wrote: > On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand wrote: > > Does your application actually iterate in order over dense ids, or is > > it just for benchmarking purposes? Because if it does, you probably > > don't actually need seeking, you could just see what the current ID in > > the terms enum is. > > Both dense ID fetches and individual ID fetches exist in the > application. I put them in a benchmark deliberately doing it as > individual fetches to get an idea of average timing for a single > operation. > > There are so many use cases of doing the individual fetches that it's > tough to enumerate. The first one I found was "fetch the term vector > for ID + field" but I'm sure there will be tons of them. > > For mapping a dense set of IDs to doc IDs (e.g. for filtering), I > would probably use something like DocValuesTermsQuery for that to get > them all in one shot. I also wondered whether writing our filters as > queries would help, but I think it would turn out to be about as fast > as DocValuesTermsQuery even if I did that. > > I'm sure the only way to really improve the speed of these filters is > to start storing these things in the text index and use query-time > joins, but I can't do that until I solve the issue of relying on > stable doc IDs and it seems like trying to solve two large problems in > a single commit would be biting off more than I can chew. > > > If you actually need seeking, then you should try > > to avoid MultiFields, it will call seedExact on each segment, while > > given what I see you could just stop after you found one segment with > > the value. > > Ah, I did wonder whether MultiFields had any behaviour like that, so > that definitely means that I will avoid using it. Then I can try other > tricks, like trying the seeks in order of segment size (the largest > segment is most likely to contain the hit.) > > TX > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --=20 Andr=C3=A1s --f46d0435c0344cb5d3051ce2da3f--