lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Modularization
Date Mon, 30 Mar 2009 23:31:09 GMT

After stiring things up, and then being off-list for ~10 days, I'm in an 
interesting position coming back to this thread and seeing the discussion 
*after* it essentially ended, with a lot of semi-concensus but no clear 
sense of hard and fast resolution or plan of action.

FWIW, here are the notes i made based on reading the thread about the 
various sentiments i noticed expressed (wether i agree with them or 
not) in order to try and get a handle on what had been discussed.  
some of these were the optinion of a single person and i've paraphrased, 
others are my generalization of similar comments made by various 

- contrib has a bad rap
- widely varying degrees of quality/stability in contrib code, hard to get 
people to rely on the "good" ones because of the "less good" ones
- many people want a good, out of hte box, kitchen sink experience (ie: 
one monolithic jar containing all the "essentials")
- need easy discoverability of all things of a given type (ie: all 
queries, all filters, all analyzers, etc...) .. ie: combined javadocs.
- need easy installation of of all things of a given type (ie: a jar 
containing all types of queries, a jar containing all types of analyzers, 
- still need to deal with contribs that have external dependencies
- still need to deal with contribs that require future versions of 
langauge (Java1.7 when core is still 1.5 compat)
- users need better guidance about "why" something is a contrib 
(additional functionality, alternate functionality, example of use, tool, 
- while we should maintain/increase modularization, documentation should 
make features of contribs more promonent without stressing the isolation 
resulting from code modularization.
- we should merge all contrib & core code into a unified src/ tree, and 
make the pacakging independent of the physical location in svn (ie: jars 
based on java package, not directory)

While I'm mostly in favor of all of these sentiments, and think it's 
really just a question of how to go about it, the last one is actually 
something i've pretty stronly opposed to -- I think the best way forward 
is to have lots of small, well isolated source trees.

code isolation (by directory hierarchy) is hte best way i've seen to 
ensure modularization, and protect against inadvertent dependency 
bleeding.  If we want to be able to produce small jars targeted at 
specific goals, and we want to be in foo.jar and to be in bar.jar then we shouldn't have 
src/java/o/l/a/foo/ and src/java/o/l/a/bar/ -- 
doing so makes it way to easy for inadvertnent dependencies to crop up 
that make FooClass depend on bar class, and thus make it impossible to use 
foo.jar without also using bar.jar at runtime.

it's certainly possible to have "all" source code in a single directory 
hierarchy, and then rely on the build system to ensure your don't 
inwarranted dependencies, but that requires you do express rules in the 
build system about what exactly the acceptible dependencies are, and it 
relies on everyone using the buildsystem correctly (missguided users of 
hand-holding IDEs could get very frustrated when the patches they submit 
violate rules of an overly complicated set of ant build files)

FWIW: having lots/more of very small, isolated, hierarcies also wouldn't 
hinder any attempts at having kitchen-sink or "essential" jars --
combining the classes from lots of little isolated code trees is a lot 
easier then extracting a few classes from one big code tree. 

One underlying assumption that seems to have permiated the existing 
discussion (without ever being explicitly stated) is the idea that most 
currently lives in src/java is the "core" and would be a single "module" 
... personally i'd like to challege that assumption.  I'd like to suggest 
that besides obvious things that could be refactored out into other 
"modules" (span queries, queryparser) there are lots of additional ways 
that src/java could be sliced...

 - interfaces and abstract clases and concrete classes for reading an 
index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader 
but not MultiReader)
 - ditto for creating/updating an index in one index-update.jar (ie: 
IndexWriter, TokenStream, Tokenizer, TokenFilter, Analyzer  but 
not any impls of the last 3)
 - ditto for searching in index-search.jar (ie: Searcher, Searchable, 
HitCollector, Query ... but not any concrete subclasses
 - simple-analysis.jar (SimpleAnalyzer, WhitespaceAnalyzer, 
LetterTokenizer, LowercaseFilter, etc...)
 - english-analysis.jar (StandardAnalyzer, etc...)
 - primative-queries.jar (TermQuery, BooleanQuery, MatchAllDocsQuery, 
MultiTermQuery, etc...)
 - range-queries.jar (RangeQuery, RangeFilter, ConstantScoreRangeQuery)


The crux of my point being that what we think of today as the lucene 
"core" is actually kind of big and bloated, and already has *a* kitchen 
sink thrown in -- it's just not neccessarily the kitchen sink many people 

a big percentage of our users may want highlighting by default, and may 
never care about function or span queries -- making it easier to get a 
monolithic jar of *everything* only addresses one of those three 
disconnects (easy access to the highlighting code) but splitting the 
current "core" up into lots of little pieces (aka: "modules") that have 
equal visibility to the existing contribs (now also "modules") would 
address all three disconnects: people wouldn't overlook modules they might 
want (like highlighting) because they are just as easy to find the "core" 
and people wouldn't wind up with bloated jars containing a lot of code 
they don't need. (beating a dead horse for a moment: this wouldn't 
proclude us from offering a bloated jar containing everything under the 

Even without making radical changes to the way our source code is 
organized, a lot of improvements could be made by having better 
documentation ... could certainly 
have more info about what is included in a release, what types of things 
can be found in a contrib, etc...  Individual contrib README files should 
certianly get beefed up to describe their purpose, their level of 
maturity, and their back compat commitments.  The demo and getting 
started guies could also be expanded to refrence the contrib jars that 
contain code many people may want to reuse...

   ...and that's all small improvements that could be made without 
radically changing anything about our source organization or packaging.  
splitting the core up into smaller modules would only help the situation, 
moving more things into the core seem like it would just make the problem 

: I agree, but at least we need some clear criteria so the future
: decision process is more straightforward.  Towards that... it seems
: like there are good reasons why something should be put into contrib:

I would agrue that is approaching the problem from the wrong direction.  

assume for the moment that we define the list of lucene "modules" as:
   ls -d contrib/* src/java src/gcj src/demo src/jsp
...but in the future we want to split up some of hte bigger "modules" and 
move each module so they have equal visibility.

i would suggest that the opperating assumption be that any new code 
contribution that adds functionality (ie: not a bug fix, or an 
enhancement to an existing Impl) belongs in a new "module" unless:
 1) compilation constraints require that it be put in an existing module 
(ie: needs to introduce a bi-directional dependency with an existing 
class which can't be refactored out into the new module)
 2) it is a natural conceptual fit with *all* of the existing classes in 
that module (ie: a new ThaiStemmerFilter could be added to an existing 
thai-analysis module)

(but an equally important to the question of "when to add to an existing 
'module' vs creating a new module?" should be the question of "when to 
split an exsting module?" ... something we've never really talked about 
for core or contribs.)

: But I don't think "it doesn't have to be in core" (the "software
: modularity" goal) is the right reason to put something in contrib.

Would it sound like a better reason if we stoped calling "core" ... i look 
at it from the point of view of: Are classes A,B&C (which are tightly 
coupled) directly related to classes X,Y&Z (also tightly coupled) ?"
... if the answer is "no" then A,B&C do not belong in the same module as 
X,Y&Z ... it doesn't matter which module we're talking about (src/java, 
contrib/highlighter etc...)

i don't think it makes any sense for the the TreiRangeQueries to be in the 
same "module" as IndexWriter, or IndexReader ... but i also don't think it 
makes sense for the trie to be in the same module as BoostingQuery or 
DuplicateFilter -- or for IndexWRiter to be in the same module as the 
existing query parser (or for hte existing query parser to be in the same 
module as the new one the IBM folks have been working on)

we can have fine grained modularity w/o having second class citizens, and 
we can achieve it without needing to make radical changes -- but putting 
more stuff into "core" isn't going to help us get there.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message