Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EFDE84934 for ; Fri, 1 Jul 2011 12:51:34 +0000 (UTC) Received: (qmail 32776 invoked by uid 500); 1 Jul 2011 12:51:33 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 32620 invoked by uid 500); 1 Jul 2011 12:51:32 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 32612 invoked by uid 99); 1 Jul 2011 12:51:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 12:51:32 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 12:51:25 +0000 Received: by vws7 with SMTP id 7so3854567vws.35 for ; Fri, 01 Jul 2011 05:51:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.178.165 with SMTP id cz5mr4691624vdc.164.1309524664263; Fri, 01 Jul 2011 05:51:04 -0700 (PDT) Received: by 10.52.157.226 with HTTP; Fri, 1 Jul 2011 05:51:04 -0700 (PDT) In-Reply-To: References: Date: Fri, 1 Jul 2011 08:51:04 -0400 Message-ID: Subject: Re: revisit naming for grouping/join? From: Michael McCandless To: dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org I think joining and grouping are two different functions, and we should keep different modules for them... On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir wrote: > Hi, > > when looking at just a very quick glance at some of the newer > grouping/join features, I found myself a little confused about what is > exactly what, and I think users might too. They are confusing! > I discussed some of this with hossman, and it only seemed to make me > even more totally confused about: > * difference between field collapsing and grouping I like the name grouping better here: I think field collapsing undersells (it's only one specific way to use grouping). EG, grouping w/o collapsing is useful (eg, Best Buy grouping hits by product category and showing the top 5 in each). > * difference between nested documents and the index-time join Similarly I think nested docs undersells index-time join: you can join (either during indexing or during searching) in many different ways, and nested docs is just one use case. EG, maybe your docs are doctors but during indexing you join to a city table with facts about that city (each doctor's office is in a specific city) and then you want to run queries like "city's avg annual temp > 60 and doctor has good bedside manner" or something. > * difference between index-time-join/nested documents and single-pass > index-time grouping. Is the former only a more general case of the > latter? Grouping is purely a presentation concern -- you are not altering which docs hit; you are simply changing how you pick which hits to display ("top N by group"). So we only have collectors here. The "generic" (requires 2 passes) collectors can group on anything at search time; the "doc block" collector requires that you indexed all docs in each group as a block. Join is both about restricting matches and also presentation of hits, because your query needs to match fields from different [logical] tables (so, the module has a Query and a Collector). When you get the results back, you may or may not be interested in retaining the table structure in your result set (ie, you may not have selected fields from the child table). Similarly, "generic" joining (in Solr/ElasticSearch today but I'd like to factor into the join module) can do any join at search time, while the "doc block" collector requires that you did the necessary join(s) during indexing. > * difference between the above joinish capabilities and solr's join > impl... other than the single-pass/index-time limitation (which is > really an implementation detail), I'm talking about use cases. Solr's/ElasticSearch's join is more general because you can join anything at search time (even, across 2 different indexes), vs doc block join where you must pick which joins you will ever want to use and then build the index accordingly. You can also mix the two. Maybe you do certain joins while indexing, but then at search time you do other joins "generically". That's fine. (Same is true for grouping). > I think its especially interesting since the join module depends on > the grouping module. The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could move TopGroups/GroupDocs into common (thus justifying its generic name!)? Then both join and grouping modules depend on common. Really TopGroups is just a TopDocs that allows some recursion (ie, each hit may in turn be another TopDocs). But TopGroups is limited now to only depth 2 recursion... we need to fix this for nested grouping. Really we just need a recursive TopDocs here.... > So I am curious if we should: > * add docs (maybe with simple examples) in the package.html or > otherwise that differentiate what these guys are, or at least agree on > some consistent terminology and define it somewhere? I feel like > people have explained to me the differences in all these things > before, but then its easy to forget. Well, each module's package.html has a start here, but I agree we should do more. I think what would be best is a smallish but feature complete demo, ie pull together some easy-to-understand sample content and the build a small demo app around it. We could then show how to use grouping for field collapsing (and for other use cases), joining for nested docs (and for other use cases), etc. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org