From java-user-return-64502-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Wed Jul 3 05:51:47 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5EB8E180627 for ; Wed, 3 Jul 2019 07:51:47 +0200 (CEST) Received: (qmail 19723 invoked by uid 500); 3 Jul 2019 05:51:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 19711 invoked by uid 99); 3 Jul 2019 05:51:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jul 2019 05:51:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EB18CC0B67 for ; Wed, 3 Jul 2019 05:51:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id XsEHpfxXalW1 for ; Wed, 3 Jul 2019 05:51:42 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::52a; helo=mail-pg1-x52a.google.com; envelope-from=ravikumar.govindarajan@gmail.com; receiver= Received: from mail-pg1-x52a.google.com (mail-pg1-x52a.google.com [IPv6:2607:f8b0:4864:20::52a]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 7578A7E208 for ; Wed, 3 Jul 2019 05:51:41 +0000 (UTC) Received: by mail-pg1-x52a.google.com with SMTP id s27so618164pgl.2 for ; Tue, 02 Jul 2019 22:51:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=NzuCYBgXQIq4v+GvZ6pCZtz1OPgz7X01vOg0zR9G220=; b=n+eDevYjv/8FNic7vlE09TuK/ULrLo1p8YcAJ4zalS3IWS5vHUhfOC6kkeupXuMU12 YR2hDXY6dfn6iEpBQqDren5/Ks9uvvKaYc06zaQdPSVRLIZEu9Co3TXwNIsbF7BwAWg1 fLb/fA9NAHi3K5DdzVyuquCdbHCDk3ckXNbaYUVAtOJej5mhQ548NWDeG1AuiP8jvE9x I5jiAsjdZqF06w+YzMSB6AwfQbq2O+eclWSDgZbNe4GRnDWPZHhJCH8jxASWKgno7dlh ddxZlblWe4SWMqWyc+THbIMa8uSdmq/OEp2jpOG1Num+huIXqulEdWx10v3h8TmDpLtT hQMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=NzuCYBgXQIq4v+GvZ6pCZtz1OPgz7X01vOg0zR9G220=; b=M50BMEfD2XNInC2iulLf2qOxGgW2K9FYxxrzHDKWnWE9bGe/8hltEAR8n/eLs75OeA P7O0L3nAA1RtjxUKps3H0oljW3RWPuSwY/35UF6l3sIzEYgLqtdoWuAy162Tjm+zAes9 JUsYm/rLe9E45LGgsaIg2NEK11+u7hxftZqn+10esQmqJ8ldPvF9zKbxwoiJCl3Qs9kL TDgh2nVLW8B0s11m6TAZbKJBMc/ood5VDe6Ms6t+U3Zwl4zqczBnO/P6wn/rCBZm0X4k q5+7uUJPtWSIX/R7vCQN4R2JcZNW5Z8CagI8OQCPronEvNDKlR1yhJoOiLqj+rIZUIWT 4WXw== X-Gm-Message-State: APjAAAUFgvQoyWVZHo7ce/NQFGK4VNJjOhLfKXGHIc0L6qzZKy9QlMjz JfNK6GWlm5i4zRi0BOL54ztXp/83MHBcAh1RJfP056qi X-Google-Smtp-Source: APXvYqyd48uZ2xC0gjDw8D7p1uNEfTfoHzYZz+tmkfJgXVXnbYcj2Mi8PAKIxRKP4QKywJA6PSPeicF4ixFzHGAmVXA= X-Received: by 2002:a63:4f46:: with SMTP id p6mr35990465pgl.268.1562133099318; Tue, 02 Jul 2019 22:51:39 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ravikumar Govindarajan Date: Wed, 3 Jul 2019 11:21:57 +0530 Message-ID: Subject: Re: block min-max values for Sort Field with Top-N query.. To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary="0000000000007a13da058cc0741c" --0000000000007a13da058cc0741c Content-Type: text/plain; charset="UTF-8" Thanks Mikhail & Adrien for the help This is the same principle that we apply for block-max WAND so > theoretically that would work, though in practice it might be a bit > hard to implement due to the fact that we don't have the APIs that you > will need. Aah, did not know block-max WAND is now in lucene! So what I am proposing looks identical to Bm-WAND.. The heavy-lifting is already done in lucene codebase. Think it should be straight-forward for us to wrap DocValues in a CustomCodec to track block min-max ords. We shall give this a shot anyways & see how it goes Directly index the field into as a term frequency instead of doc > values, e.g. using FeatureField. One downside is that you can only > sort in one order efficiently. > Thanks for suggestion. Sure will try & dabble with FeatureField too! -- Ravi On Tue, Jul 2, 2019 at 6:52 PM Adrien Grand wrote: > Hello, > > This is the same principle that we apply for block-max WAND so > theoretically that would work, though in practice it might be a bit > hard to implement due to the fact that we don't have the APIs that you > will need. > > I have considered the idea of adding information about blocks to doc > values a couple times, but I think it'd be better to either: > - Directly index the field into as a term frequency instead of doc > values, e.g. using FeatureField. One downside is that you can only > sort in one order efficiently. > - Or using LongDistanceFeatureQuery if your field is also indexed > with points, by passing the max value of your index as the "origin" if > you want to sort in decreasing order and the min value if you want to > sort in increasing order. This would be a bit less efficient than > FeatureField but would allow sorting in either ascending or descending > order. > > > > On Tue, Jul 2, 2019 at 3:01 PM Ravikumar Govindarajan > wrote: > > > > Our Sort Fields utilize DocValues.. > > > > Lets say I collect min-max ords of a Sort Field for a block of documents > > (128, 256 etc..) at index-time via Codec & store it as part of DocValues > at > > a Segment level.. > > > > During query time, could we take advantage of this Stats when Top-N query > > with Sort Field is requested? > > > > Typically, what I had in mind is a SortStats class with the following > method > > > > int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*, > > boolean sortDesc) { > > // 1. Fetch the doc-ranges that has >= > > *min-sort-ord-seen-till-now* > > * // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If > > SortDesc=true) > > * Return the least doc-range <= max-doc-seen-till-now *(If > > SortDesc=false) > > } > > > > Top-N Collector can keep track of the *max-doc-seen-till-now & > > min-sort-ord-seen-till-now *variable during query time & then call the > > *SortStats.seek()* for a possible skip of blocks of documents that may > > otherwise be needlessly offered & popped out from the priority queue > > > > I understand this simplistic logic depends on sort-field data > distribution > > & won't work for multi-sort field queries or out-of-order scoring etc.. > > > > But, in general will this be a good idea to explore or something that is > > best not attempted? > > > > Any help is much appreciated > > > > -- > > Ravi > > > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --0000000000007a13da058cc0741c--