lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El" <>
Subject Re: Modularization
Date Wed, 01 Apr 2009 09:03:35 GMT
On Mon, Mar 30, 2009, Chris Hostetter wrote about "Re: Modularization":
> code isolation (by directory hierarchy) is hte best way i've seen to 
> ensure modularization, and protect against inadvertent dependency 
> bleeding.
> it's certainly possible to have "all" source code in a single directory 
> hierarchy, and then rely on the build system to ensure your don't 
> inwarranted dependencies, but that requires you do express rules in the 
> build system about what exactly the acceptible dependencies are, and it 
> relies on everyone using the buildsystem correctly (missguided users of 
> hand-holding IDEs could get very frustrated when the patches they submit 
> violate rules of an overly complicated set of ant build files)

In a project I've been involved in, we are building a library with similar
concerns that Lucene now faces - on one hand you want to be a "kitchen sink"
providing features for everyone, but on the other hand you want to create
small jars and allow people who only need a small number of features to pick
only some of the jars, instead of one huge jar.

We've been doing this using just one source tree (like in Lucene), and
instead ensuring the separation using the build system. We did not, like you
suggest, found this to complicated to set up or maintain. The only snag, of
course, is that people who don't know how to write build.xml properly do
not touch it, but it's exactly like people who don't know how to properly
code in Java do not touch our source code :-) Having a "hand-holding IDE"
is no replacement for knowing how to code, whether the code is Java source
code or Ant configuration.

The idea of the Ant-based approach is to have the Ant build script compile
each module source separately, allowing it only to refer to pre-defined
dependencies. This instead of the more usual approach of compiling all the
source code together (and thus allowing unwanted dependencies) and only
collecting the jars from the compiled classes at the very end.

For example, let's say that we want to build three JARs of three packages,
foo.A, foo.B, and foo.C. Let's say that foo.A is stand-alone (doesn't need
the other source code to compile), and foo.B depends on stuff from foo.A
(and must not depend on stuff from foo.C).

In that case, I would first create an Ant rule to build a jar from the sources
of foo.A, and them alone (which ensures that foo.A doesn't accidentally
depend on foo.B or foo.C). Note the "includes" argument to javac, and the
separate destdir:

        <target name="A.compile">
                <mkdir dir="${build.classes}/A"/>
                <javac srcdir="${src}" destdir="${build.classes}/A"
                        sourcepath="" listfiles="no">

	<target name="A.jar" depends="A.compile">
                <mkdir dir="${build.jars}"/>
                <jar destfile="${build.jars}/A.jar" basedir="${build.classes}/A">

Now, we do a similar thing for B.jar - when compiling it, we allow the
compiler to look at only the source code of foo.B, and at the previously
built A.jar. It cannot, for example, accidentally use stuff from foo.C:

        <target name="B.compile" depends="A.jar">
                <mkdir dir="${build.classes}/B"/>
                <javac srcdir="${src}" destdir="${build.classes}/B"
                        includes="foo/B/**/*.java" sourcepath="" listfiles="no">
                        <pathelement location="${build.jars}/A.jar" />

	<target name="B.jar" depends="B.compile">
                <mkdir dir="${build.jars}"/>
                <jar destfile="${build.jars}/B.jar" basedir="${build.classes}/B">

Putting my money (or rather, time) where my mouth is, is there an interest
that I try to build a build script for Lucene to demonstrate these ideas
in action?
> FWIW: having lots/more of very small, isolated, hierarcies also wouldn't 
> hinder any attempts at having kitchen-sink or "essential" jars --
> combining the classes from lots of little isolated code trees is a lot 
> easier then extracting a few classes from one big code tree. 

But I think you've swept on issue under the rug: what happens when the
hierarcies aren't completely isolated? For example, an analyzer package
obviously depends on some Lucene core package. Or the query parser package
depends on the wildcard query package (for example). You need to specify these
dependencies somehow, and allow only them. How do you do that? Via an
Eclipse ".project" file in each of the small hierarcies? How is this any
better than having an Ant build file? How would anyone not using Eclipse
use this sort of setup?

Another problem with your separate-source-hierarchies proposal is that it
requires some drastic changes to the source code tree. With my Ant-based
proposal, you don't need *any* change to the source code tree we have now
(heck, you can even keep the "contrib/" directory as is), you just need to
change one file - build.xml. Of course, if you discover unwanted dependencies
in the existing code (e.g., the indexing code accidentally depends on the
whitespace analyzer) you'll need to fix them.

> One underlying assumption that seems to have permiated the existing 
> discussion (without ever being explicitly stated) is the idea that most 
> currently lives in src/java is the "core" and would be a single "module" 
> ... personally i'd like to challege that assumption. 

I wholeheartedly agree.

> I'd like to suggest 
> that besides obvious things that could be refactored out into other 
> "modules" (span queries, queryparser) there are lots of additional ways 
> that src/java could be sliced...
>  - interfaces and abstract clases and concrete classes for reading an 
> index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader 
> but not MultiReader)

Interesting ideas. Thus one might use a RamDirectory in his application,
and not incur the code size of FSDirectory.

However, at some point you have to wonder how fine-grain we want the division
to be. For example, if the FS-specific stuff only amounts to 20k of code
(and I'm just making this number up), how important is it to have a separate
jar for it? What do we lose (if anything) by having too many tiny jars?

> The crux of my point being that what we think of today as the lucene 
> "core" is actually kind of big and bloated, and already has *a* kitchen 
> sink thrown in -- it's just not neccessarily the kitchen sink many people 
> want.  

I agree.

> a big percentage of our users may want highlighting by default, and may 
> never care about function or span queries -- making it easier to get a 
> monolithic jar of *everything* only addresses one of those three 
> disconnects (easy access to the highlighting code) but splitting the 
> current "core" up into lots of little pieces (aka: "modules") that have 
> equal visibility to the existing contribs (now also "modules") would 
> address all three disconnects: people wouldn't overlook modules they might 
> want (like highlighting) because they are just as easy to find the "core" 
> and people wouldn't wind up with bloated jars containing a lot of code 
> they don't need. (beating a dead horse for a moment: this wouldn't 
> proclude us from offering a bloated jar containing everything under the 
> sun)

Again, I wholeheartedly agree.

Nadav Har'El                        |     Wednesday, Apr  1 2009, 7 Nisan 5769
IBM Haifa Research Lab              |-----------------------------------------
                                    |"A witty saying proves nothing." --           |Voltaire

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message