lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Modularization
Date Tue, 31 Mar 2009 12:21:40 GMT
On Mon, Mar 30, 2009 at 7:31 PM, Chris Hostetter
<> wrote:

> code isolation (by directory hierarchy) is hte best way i've seen to
> ensure modularization, and protect against inadvertent dependency
> bleeding.

OK I agree this (divorced top-level directories) is a great way to
enforce modularity and we should use that.

It seems the toplevel directory structure could still have subdirs,




And in those "leaf" subdirs above would be the package subdir
structure (src/{java,test}/org/apache/lucene/...).

Though "svn checkout" and "svn update" and "svn diff" are going to
take quite a bit longer with this switch...

> One underlying assumption that seems to have permiated the existing
> discussion (without ever being explicitly stated) is the idea that
> most currently lives in src/java is the "core" and would be a single
> "module" ... personally i'd like to challege that assumption.  I'd
> like to suggest that besides obvious things that could be refactored
> out into other "modules" (span queries, queryparser) there are lots
> of additional ways that src/java could be sliced...

+1: I very much agree what is now called "core" should be refactored
as a number of modules.

So the general new proposal here seems to be lets break up src/java/*
into separate modules (each under its own toplevel directory), just
like contrib/* is today.

And move Lucene to an "a la carte" model for what we now call core.
(what we now call contrib is already "a la carte" today).

We would then do away with the top level "core" vs "contrib", and
everything would simply be "modules", where each module has
metadata/javadocs stating:

  * JRE version required

  * What external dependencies (including dependencies to other Lucene
    modules) are needed

  * Some measure of "maturity"

  * Back-compat policy


Then during build we can package up certain combinations.  I think
there should be sub-kitchen-sink jars by area, eg a jar that contains
all analyzers/tokenstreams/filters, all queries/filters, etc.

This does make the future decision process far easier.  Rather than
have a capricious and ill-defined "does it go into core vs contrib"
question, we now simply decide if it goes into an existing module or
makes a new one.

> Even without making radical changes to the way our source code is
> organized, a lot of improvements could be made by having better
> documentation .

Agreed. I think this is actually somewhat orthogonal, though should
follow more naturally once Lucene is simply a collection of modules.
I would think we present "all" and a "per-module" sets of javadocs,
plus javadocs aggregated based on how the JARs aggregate?  (Ie I could
browse the "kitchen-sink" javadocs, the "all analyzers" javadocs, or
the "thai analyzers only" javadocs).

> (ie: a new ThaiStemmerFilter could be added to an existing
> thai-analysis module)

So, how would you refactor the various sources of
analyzers/tokenstream/tokenfilters we have today
(src/java/org/apache/lucene/analysis/*, contrib/snowball/*,
contrib/collation/* and contrib/analyzers/*)?  (Even contrib/memory
has a neat PatternAnalyzer, that operates on a string using a regexp
to get tokenns out, that only now am I just discovering).

We also need to think about how this impacts our back-compat policy.
EG when are we allowed to split up modules into sub-modules, or merge

Assuming there's general consensus on this "break core into modules"
approach, I think the next step is to take in inventory of all of
Lucene's classes and roughly divide them into proposed modules, and
iterate on that?  Hoss do you want to take a first stab at that?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message