lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: Does Index have a Tokenizer Built into it
Date Mon, 16 Jul 2007 08:26:31 GMT
Hello,

> Ard,
> 
> I do have access to the URL's of the documents, but because I 
> will be making
> short snippets for many pages (suppose it had about 20 hits 
> per page and I
> need to make Snippets for each of them) I was worried it would be
> inefficient to open each "hit" tokenize it and then make the 
> Snippet, of

Yes, getting all the documents over http just to get the snippet, for example the first 2
lines, is really bad for your performance in search overviews.

Logically, what you want to show, you need to store in your index. For example, if for search
hits you need to show the title and subtitle, just store these two in the index. If you want
to have a google like highlighter of text snippets where the term occured, you need to store
the entire text IIRC (see HighlighterTest in lucene). 

How many docs are you talking about that you cannot store the entire content? 

You could also just index the content and not store it, and in another lucene field, store
the first 2 or 3 lines of  the document, which serve as text snippet. Making correct extracts
of text snippets is very hard (see lingpipe for example)

Regards Ard

> course the price of this may be worth the price of the increased Index
> size.  I have been looking into storing "Field Vectors with 
> positions" in
> the index.  It seems that by doing this I will have access to 
> everything
> that the Tokenizer is giving me correct?   Will I need to 
> store "term text"
> in order to be able to access the actual term instead of 
> stemmed words?
> 
> Thanks for all your help,
> 
> --JP
> 
> On 7/13/07, Ard Schrijvers <a.schrijvers@hippo.nl> wrote:
> >
> > Hello,
> >
> > > I'm wondering if after
> > > opening the
> > > index I can retrieve the Tokens (not the terms) of a
> > > document, something
> > > akin to IndexReader.Document(n).getTokenizer().
> >
> > It is obviously not possible to get the original tokens of 
> the document
> > back when you haven't stored the document, because:
> >
> > 1) the analyzer might have removed stop words in the first place
> > 2) the terms in lucene index are perhaps stemmed words / 
> synonyms / etc
> > etc
> > 3) how would you expect things like spaces, commas, dots etc to be
> > restored?
> >
> > And, I think what you want does not comply with an inverted 
> index. When
> > you do not store the document, you always loose information 
> about the
> > document during indexing/analyzing
> >
> > How many documents are you talking about? They must be 
> either somewhere on
> > FS or accessible over http...when you need the document, 
> why not just
> > provide a link to the original location?
> >
> > Regards Ard
> >
> > >
> > > In summary:
> > >
> > > My current ( too wasteful implementation is this)
> > >
> > > StandardTokenizer(BufferedReader (
> > > IndexReader.Document(n).getField("text"
> > > )  )
> > >
> > > I'm wondering if Lucene has a more efficient manner to
> > > retrieve the tokens
> > > of a document from an index.  Because it seems like it has
> > > information about
> > > every "term" already, Since you can get retrieve a
> > > TermPositions object.
> > >
> > > Thanks,
> > >
> > >
> > > --JP
> > >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message