Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <BANLkTi=rCXo-YC0z7=qxeEUBMMyn5YY6KQ@mail.gmail.com>
References: <BANLkTi=rCXo-YC0z7=qxeEUBMMyn5YY6KQ@mail.gmail.com>
Date: Fri, 1 Jul 2011 08:51:04 -0400
Message-ID: <BANLkTimdWgt_YQb2ZcV=CBPk3iXm+cRY=g@mail.gmail.com>
Subject: Re: revisit naming for grouping/join?
From: Michael McCandless <lucene@mikemccandless.com>
To: dev@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1

I think joining and grouping are two different functions, and we
should keep different modules for them...

On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Hi,
>
> when looking at just a very quick glance at some of the newer
> grouping/join features, I found myself a little confused about what is
> exactly what, and I think users might too.

They are confusing!

> I discussed some of this with hossman, and it only seemed to make me
> even more totally confused about:
> * difference between field collapsing and grouping

I like the name grouping better here: I think field collapsing
undersells (it's only one specific way to use grouping).  EG, grouping
w/o collapsing is useful (eg, Best Buy grouping hits by product
category and showing the top 5 in each).

> * difference between nested documents and the index-time join

Similarly I think nested docs undersells index-time join: you can
join (either during indexing or during searching) in many different
ways, and nested docs is just one use case.

EG, maybe your docs are doctors but during indexing you join to a city
table with facts about that city (each doctor's office is in a
specific city) and then you want to run queries like "city's avg
annual temp > 60 and doctor has good bedside manner" or something.

> * difference between index-time-join/nested documents and single-pass
> index-time grouping. Is the former only a more general case of the
> latter?

Grouping is purely a presentation concern -- you are not altering
which docs hit; you are simply changing how you pick which hits to
display ("top N by group").  So we only have collectors here.

The "generic" (requires 2 passes) collectors can group on anything at
search time; the "doc block" collector requires that you indexed all
docs in each group as a block.

Join is both about restricting matches and also presentation of hits,
because your query needs to match fields from different [logical]
tables (so, the module has a Query and a Collector).  When you get the
results back, you may or may not be interested in retaining the table
structure in your result set (ie, you may not have selected fields
from the child table).

Similarly, "generic" joining (in Solr/ElasticSearch today but I'd like
to factor into the join module) can do any join at search time, while
the "doc block" collector requires that you did the necessary join(s)
during indexing.

> * difference between the above joinish capabilities and solr's join
> impl... other than the single-pass/index-time limitation (which is
> really an implementation detail), I'm talking about use cases.

Solr's/ElasticSearch's join is more general because you can join
anything at search time (even, across 2 different indexes), vs doc
block join where you must pick which joins you will ever want to use
and then build the index accordingly.

You can also mix the two.  Maybe you do certain joins while indexing,
but then at search time you do other joins "generically".  That's
fine.  (Same is true for grouping).

> I think its especially interesting since the join module depends on
> the grouping module.

The join module does currently depend on the grouping module, but for
a silly reason: just for the TopGroups, to represent the returned
hits.  We could move TopGroups/GroupDocs into common (thus justifying
its generic name!)?  Then both join and grouping modules depend on
common.

Really TopGroups is just a TopDocs that allows some recursion (ie,
each hit may in turn be another TopDocs).  But TopGroups is limited
now to only depth 2 recursion... we need to fix this for nested
grouping.  Really we just need a recursive TopDocs here....

> So I am curious if we should:
> * add docs (maybe with simple examples) in the package.html or
> otherwise that differentiate what these guys are, or at least agree on
> some consistent terminology and define it somewhere? I feel like
> people have explained to me the differences in all these things
> before, but then its easy to forget.

Well, each module's package.html has a start here, but I agree we
should do more.

I think what would be best is a smallish but feature complete demo, ie
pull together some easy-to-understand sample content and the build a
small demo app around it.  We could then show how to use grouping for
field collapsing (and for other use cases), joining for nested docs
(and for other use cases), etc.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org