From java-user-return-64411-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Tue Jun 11 11:36:36 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id D9D2B180627 for ; Tue, 11 Jun 2019 13:36:35 +0200 (CEST) Received: (qmail 28995 invoked by uid 500); 11 Jun 2019 11:36:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28983 invoked by uid 99); 11 Jun 2019 11:36:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jun 2019 11:36:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BDC86C772C for ; Tue, 11 Jun 2019 11:36:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.799 X-Spam-Level: * X-Spam-Status: No, score=1.799 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id PGmVqpxMj3u8 for ; Tue, 11 Jun 2019 11:36:27 +0000 (UTC) Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 671AA5F531 for ; Tue, 11 Jun 2019 11:36:27 +0000 (UTC) Received: by mail-ed1-f50.google.com with SMTP id k8so4941854eds.7 for ; Tue, 11 Jun 2019 04:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=vPmXuGNCbdEb2A6PKpou6zVoOzsmqKb4j2l/oKG7dOU=; b=DGznkuDeQNJhjwas7kuQ7s70lkVwNl1fR917is43RH74qLtyn0cyH3LovZmFpzwU0D egkZx7mED68sZVsrqnhFLKkNw/XkGh4f0USoHYN3a/C29YucKF1DBhjJIsBpC75L52/I 7JLbIb5akIK/M8RdS8DKUcBwss4NIs918pi310s5/fo38JAj+0STKbDwFfwsoYNJ4snT gBcZfUFv4CxJMMpWac11wv+w1bKzoEu/DEZegFodHApP716PS11o8A/a5j6J5JrMqPJo /fzhPk9AIEeC/aHqpRzKgWNGAwdPC47RXd7oWVOuSbl5IfbjXx09CYWF3ARw0ZaS0De3 fO1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=vPmXuGNCbdEb2A6PKpou6zVoOzsmqKb4j2l/oKG7dOU=; b=fE2G97kfGfc0g2mWujYe60z1mnJrw3PsttKbzrIKs/nt7CldquKDdDZ3tClakLjtjY XU+3tCj40nGD7Tz3yG3rH6Ptckm0i1lK+/VBSP4wXJX2/B95qalx/umLByOGbrnhZDo7 LS4ROZCNiX+Axv+DA8nF4QZWieiR59dGSeqq60wwcsx3cI6THB+vDTs83t7qUMXhP8EX hoNAcMIkcA2STCwmulMBB0/wKFYZRS9x1cvGYwv8xsB9rUSPCs2KC9JapQ3P/2f/upKW /oIMqhnl/cvGmmnUfR3jMkn5hYta5pQ0jMS1rlxsfX0EJUdUaGfazh8Mc0//d3vaB7j+ Woaw== X-Gm-Message-State: APjAAAVLCu9gGzwuqjEzFNq1U1L5i3QuhC/UFAhFNnbuB6mG6U0NOKvF j/xz43MtvlSeLSx1lywUmSFXcCy5eO6j37eMoJmaPQ== X-Google-Smtp-Source: APXvYqwRm+0PquvVx0mCNQwxRN4Ue52ila80Osw4l2JXfjre70YeblHLP1gmKJS11jOYBF7yKX7ZvArNGG7H8+yujSg= X-Received: by 2002:a50:f781:: with SMTP id h1mr11036020edn.240.1560252979458; Tue, 11 Jun 2019 04:36:19 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Michael Sokolov Date: Tue, 11 Jun 2019 07:36:05 -0400 Message-ID: Subject: Re: Sampled Queries -- Use Cases and Feedback To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary="00000000000099b79d058b0ab442" --00000000000099b79d058b0ab442 Content-Type: text/plain; charset="UTF-8" Atri, in the abstract it sounds like a great idea, but in practice it will only be as good as the data that drives it. I think that to make this work it would be a good idea to write up a proposal of some sort targeting different open (or commercial, although I doubt you would get much of this) source projects that use lucene-based search asking them to contribute their data. Also can we learn anything from the previous attempt? What did they try? How can this effort about the same pitfalls? Even with document and query data, you still need some kind of relevance ground truth, and this is notoriously difficult to get. Probably click through stats are the most generic proxy for that. So as a thought experiment, maybe contact Wikipedia and ask if they would be willing to share some sample of queries and logs. Or did you have another idea how to drive this? Then with one pilot participant, you could maybe get others to join. I think if you have some commitments, or at least serious expression of interest, from data providers, then you can start to think about what to actually do with the data, but I would start there? On Mon, Jun 10, 2019, 2:54 AM Atri Sharma wrote: > Any thoughts on this? I am envisioning applications to machine > learning systems, where the training dataset might be a small sample > of the entire dataset, and the user wants scoring to be done only on > samples of the dataset. > > On Fri, Jun 7, 2019 at 5:45 PM Atri Sharma wrote: > > > > Hi All, > > > > While working on a new Query type, I was inclined to think of a couple > > of use cases where the documents being scored need not be all of the > > data set, but a sample of them. This can be useful for very large > > datasets, where a query is only interested in getting the "feel" of > > the data, and other queries where the data is being aggregated over > > time, so a wide enough sample of the data is good enough for the user > > at the tradeoff of improved performance. Faceting already has sampling > > mechanisms, so there are ideas to be borrowed from that part. > > > > I have some ideas on introducing a new query type and associated > > semantics to allow this functionality to be present from ground up. > > Specifically, a query type which wraps another query and "feeds" > > offsets to the inner query, along with a limit of collection of hits. > > I can go in more detail, but wanted to get some thoughts and feedback > > before delving deeper. > > > > Atri > > > > -- > Regards, > > Atri > Apache Concerted > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --00000000000099b79d058b0ab442--