lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Mularien <pmular...@deploy.com>
Subject Re: Development plans for Lucene?
Date Thu, 31 Oct 2002 18:46:36 GMT
Walked away from my desk [yesterday!] and came back to lots of replies, 
so I'll try to address them all.

BG:

>>The reason why the project is not very active from the development
>>point of view is, in my opinion, in part because only 1 person knows
>>Lucene inside out, and that person is busy on other projects.  I am
>>talking about Doug Cutting.
>>The stability of the core is another reason, like Peter Carlson
>>mentioned.
>>    
>>
>One of the core values that made Lucene successful in the first place
>is its minimalism.  Lucene is very clean, compact, and amazingly
>small.  Doug has been, rightly so, very skeptical of new additions
>that are not part of the core functionality.  There are lots of good
>ideas floating around to make Lucene more useful (crawler, library of
>file-format handlers, etc), but I think that Doug and several other
>contributors would really rather see these _not_ be part of the Lucene
>core.  They are valuable, but peripheral to Lucene's core task --
>maintaining and searching the index.
>
>I think the "Doug is busy" argument basically boils down to "Doug
>thinks this project is mostly done."  And I would tend to agree.
>Could it be improved?  For sure.  Is it missing major pieces?  I don't
>think so.
>
>There are lots of great projects that have limited development
>activity -- like BCEL.  That's because they work great at what they do
>already.
>
This is an excellent point.

>>>... if we 
>>>were to take a poll, there are a lot of features outside of the core
>>>API that [potential] users would love to see implemented in the software.
>>>      
>>>
>Depends who you polled!
>
Well, I guess what I am interested in is the fact that so many 
contributors are building projects on or around Lucene, yet none of that 
incredibly useful (to some) stuff ends up back in the core.

>>>Alternatively, I know there are some great projects in the sandbox
>>>and 
>>>contrib that would be great candidates for inclusion into Lucene
>>>proper.
>>>      
>>>
>True, but I think some of these are better off as being their own
>projects or sub-projects, rather than part of the core.  Its more of a
>question of visibility.  One way to make them more visible/credible is
>to put them in the core, but that's not the only way to do so, and I'm 
>not sure that's the best way.
>
>I think many of these ideas would be better served by combining them
>into a Lucene-based application framework, combining finding files to
>index (crawling), digesting them (format stripping), etc.  It wouldn't
>have to be part of Lucene proper to be a very useful, flexible
>package.  In fact, the issues at the application level are very
>different from the issues at the core search level and call for a
>different set of skills and perspectives.
>
Right, correct. The issues at the core search level are much more 
academic and must/should be confined to a few people. The issues at the 
application level could easily be solved and extended by building the 
kind of framework you advocate.

>I think that a good goal for the core would be to encourage developers
>to build searching frameworks around Lucene so that no developer
>actually has to use Lucene directly (not to suggest that Lucene is by
>any means unpleasant to use directly.)
>
Peter Carlson said:

> Just to expand on what you have already stated, Kelvin, Clemens and 
> myself are putting together an application framework proposal that we 
> are going to submit to the Lucene Dev list. The major idea is to 
> include the benefits of document acquisition and pipeline approach of 
> LARM, the indexing flexibility of Indyo and the integrated 
> search/sorting functionality of the SearchBean.
> There are a lot of issues to be worked out, but the main point is that 
> we can build on the core Lucene API to create a framework.

... and Clemens said:

>Otis, Kelvin and me have been discussing how we could leverage Lucene on a
>next level. We have some components in the sandbox (LARM Webcrawler, Indyo
>indexing framework) that have to be weaved together with Lucene.
>
>This could end up in a real search engine server. I call it "Lucene Advanced
>Retrieval Machine" for myself right now, 'cause then I can stick with the
>LARM acronym... :-)
>

Great! Sounds like you two and Brian agree (almost) --

Brian said:

>>> Analyzers for more languages - Various people
>>    
>>
>I think this is a good example of where there's some differences in
>thinking about what should go in the core.  I think building a library
>of analyzers is great, similar to a library of file format decoders.
>Very useful, but I'm not sure any of the language-specific stuff
>(including the English-specific stuff, like the Porter stemmer) really
>belongs in the core.
>
>You see where I'm going here, don't you?  I want people to build on
>Lucene, I want them to have a place where people can find their work
>and use it, but I also want to keep Lucene small, focused, and
>compact, since I think those values are a big part of its success.
>
I think I'd agree more with Peter on this one, but I agree with you in 
principle. However, in keeping with the theme of allowing the core 
engine to be a "framework" upon which other applications could be built 
or the engine further extended, I can't imagine a search engine 
framework supplied without some of these (IMHO) critical components, 
such as language-specific analyzers. How would we define "core" vs. "non 
core" languages? We could supply a separate "language pack" or 
something, but the problem starts to become the complexity of 
installation. If we don't ship some of these things that people use all 
the time as core, then people will not take the time to evaluate and set up.

An alternative would be to have some kind of smart packaging method that 
would allow easy addition on top of the "core" (read: language agnostic) 
engine. I'm thinking something along the lines of an RPM-like system 
where one could plug additional functionality onto the core. I think the 
point I'm trying to make is that I agree in a sense that the core should 
remain "pure", however the install process would need to be drastically 
improved -- most people won't care to download 10 projects out of the 
sandbox/contrib and figure out how to plug them in.

Peter C said:

> As far as core API level features, I think there are few that are 
> going to be added but some of the ones in the pipeline are
>
> Term position support - lead on by Dmitry Serebrennikov
> Analyzers for more languages - Various people
> Built-in support for a Date and Number Fields - Lead by Brian, help by 
> Peter
> Configurable QueryParser Syntax (i.e. default operator, min # 
> characters before wildcard, ...) - Still in design
>
> As far as real core information retrieval features, I don't know the 
> subject that well. 

These are examples of great additional features. Maybe there would be 
some way we could start tracking some of them and putting together some 
plans, after some needed discussion on the contention between core and 
add-on that Brian highlights above. Again, encouraging active 
development is my goal here, if people see a plan, with obvious tasks 
that need accomplishing, they'll be more willing to help do it. Can we 
start by reformatting the TODO.txt that's in CVS, and plop that up on 
the site? I'd volunteer to take first crack, then maybe some of the 
people that have been closer to the project than I (Otis, Brian, Peter, 
Clemens, etc.) can contribute additional things. Once we get a "wish 
list" of stuff to do, then it'll become easier to figure out the best 
way to get it all done.

> Common questions/issues to resolve:
> Ease of updating and adding documents to live index
> Better demo to show people how to use Lucene.
> More documentation

Also these make good non-developer tasks.

Anyway, I hope I'm not stirring things up too much, but maybe if we can 
collect some of these ideas that have been kicking around, and some of 
the great contributions, we can figure out a way to get some forward 
momentum going on the project!

Thanks
Peter



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message