lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: retrieve tokens
Date Wed, 22 Dec 2004 18:45:15 GMT
For I store the full text of web pages in Lucene, in order to
provide full-text web searches.  Nutch ( does the same.  You
can set the maximal number of tokens you want indexed via IndexWriter. 
You can also compress fields in the newest version of Lucene (or maybe
just the one in CVS), which may help you if you are considered about
disk space, although I wouldn't want to have to uncompress each hit's
200 pages worth of text in order to create a summary with KWIC. :)

Oh, and you asked about highlighter and field/query matching.  I
_think_ it won't help you with that, but I'm a bit behind on the
highlighter, so you should check the version in CVS and see if it's
capable of this.


--- "M. Smit" <> wrote:

> Erik Hatcher wrote:
> >
> > Highlighter does not mandate you store your text in the index.  It
> is 
> > just a convenient way to do it.  You're free to pull the text from 
> > anywhere and highlight it based on the query.
> >
> >> Furthermore, you are saying that the highlighter takes care of the
> >> corresponding field/words for me and pull up a context snippet?
> Ouch, 
> >> why haven't I stumpled upon the sandbox....
> >
> >
> > See a screenshot of it here: (going live
> > within a week!)
> >
> Oh bliss, Oh joy.. This is exactly what I'm looking for... I'll
> plunge 
> in to it and let you know!
> But for the other issue on 'store lucene' vs 'store db'. Does anyone
> can 
> provide me with some field experience on size?
> The system I'm developing will provide searching through some 2000 
> pdf's, say some 200 pages each. I feed the plain text into Lucene on
> a 
> Field.UnStored bases. I also store this plain text in the database
> for 
> the sole purpose of presenting a context snippet.
> If I were to use the Highlighter with a Field.Text, I will not use
> the 
> database plain part altogether. But still I'm a little worried about 
> speed/space issues. Or am I just seeing bears-on-the-road (Dutch
> saying, 
> in plain English: making a fuzz about nothing)..
>     Martijn
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message