lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven J. Owens" <>
Subject Re: Lucene features
Date Thu, 04 Sep 2003 07:02:27 GMT
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
> Lucene Users List <>
> > > I am wondering if Lucene is the way to go for my project.
> >      Probably.  Tell us a little about your project.
> It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
> in size. They don't ever change, and are on a CD-ROM. Each file contains a
> bunch of small documents. I just create one index for all 4 of them. These
> documents are for an association that I belong to - they contain a history
> of the association's documents - and my application allows you to search
> them.

     Well, aside from your concerns about the second list, Lucene
seems perfect for your needs.  You'd parse apart the four big files
into a bunch of small documents, the parse those small documents and
create lucene Documents, containing Fields, and add them to the index.
> They are actually currently indexed by an application called
> 'Sonar', by Virginia Systems. But I REALLY didn't like using their
> user interface - blech - so I decided to write a new interface for
> my own use. But Sonar costs some real bucks to be able to develop
> against their search API, so I found Lucene, and decided to go with
> it.
> Here are the search features that 'Sonar' has :
>   Boolean Searching
>   Proximity Searching
>   Wild Card Searching
>   Field/Block Searching

     I'm not sure what Field/Block means.  Boolean, Proximity and
WildCard, are pretty typical in Lucene searches.  You should probably
take a look at the Query Parser syntax docs:

>   Relevancy Ranking / Date Ranking

     Lucene search results are typically ranked by relevance, and you
can tweak the search to adjust this (there's a fair bit of discussion
of this in the lucene-user archives, a good keyword to look for is
"slop" and "boost").

     Sorting output by date might take some finesse.  I haven't played
with sorting by date, but I'd expect to handle that by directly
instantiating a QueryTerm to indicate the date issues.

>   List of Occurrences in Context

     I assume here that you mean displaying the results with a little
snapshot of the text around it.  There have been discussions about how
best to do this (often focused around highlighting the search terms in
the displayed text) on the lucene-users list.  Check the list archive.
>   Phonetic Searching

     I'd guess you need to build this one yourself, perhaps by using a
soundex algorithm when indexing the original data files.

>   Synonyms/Concepts

     Likewise... you'd need to come up with some sort of ontology of
synonyms and concepts, then parse the fields you're indexing and
generate a synonym/concept field that you'd add to the lucene

>   Relational Searching
>   Associated Words
>   Drill Down Search Narrowing

     I'm not sure what these three mean.

> I think that Lucene has all the features in the first group. How does it
> stack up against the second group ?

     I'm afraid I haven't been too helpful here.  Perhaps if you
clarify what the above mean, folks can post about how to implement it
in Lucene.

> I'm writing the whole thing in Swing, which has been time consuming,
> and so have invested quite a bit of time into this project. But I'm
> seeing the end of the tunnel, and want to make sure that I'm going
> down the right path before I spend too much more time on it.

     It sounds like you ought to at least seriously consider using
Lucene, if you can find or implement equivalent features, or decide
you can live without them.

Steven J. Owens

"I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt." - Me at

View raw message