lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Paul Sondag" <jsond...@uiuc.edu>
Subject Re: Does Index have a Tokenizer Built into it
Date Mon, 16 Jul 2007 16:39:35 GMT
Some of the data sets that will be using have about 2 TB of data (90 million
web pages).  The Snippet I will be generating I would like to include the
words that are being queried, so I don't want to simply store the first 2 or
3 lines.  I have looked at the HighlighterTest and I do believe that it
requires the entire text of the document.  However, unlike the highlighter I
know where the termOffset in the document.

The input to my Snippet will be a vector of querywords and their offsets in
the document.  (not their position in the document).  I'm reading about the
option "term vectors" I can store while indexing my data.  It seems to be
much more efficient than storing the entire document, I'm just not sure if
the "term offset" is the same as a "token offset".  Here's what I'm reading
in case I'm totally off the ball here and this is useless to me:

http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors

It seems like this has all the information that I would have if I tokenized
the document anyways, or am I missing something?

Thanks again for all the help!

--JP




On 7/16/07, Ard Schrijvers <a.schrijvers@hippo.nl> wrote:
>
> Hello,
>
> > Ard,
> >
> > I do have access to the URL's of the documents, but because I
> > will be making
> > short snippets for many pages (suppose it had about 20 hits
> > per page and I
> > need to make Snippets for each of them) I was worried it would be
> > inefficient to open each "hit" tokenize it and then make the
> > Snippet, of
>
> Yes, getting all the documents over http just to get the snippet, for
> example the first 2 lines, is really bad for your performance in search
> overviews.
>
> Logically, what you want to show, you need to store in your index. For
> example, if for search hits you need to show the title and subtitle, just
> store these two in the index. If you want to have a google like highlighter
> of text snippets where the term occured, you need to store the entire text
> IIRC (see HighlighterTest in lucene).
>
> How many docs are you talking about that you cannot store the entire
> content?
>
> You could also just index the content and not store it, and in another
> lucene field, store the first 2 or 3 lines of  the document, which serve as
> text snippet. Making correct extracts of text snippets is very hard (see
> lingpipe for example)
>
> Regards Ard
>
> > course the price of this may be worth the price of the increased Index
> > size.  I have been looking into storing "Field Vectors with
> > positions" in
> > the index.  It seems that by doing this I will have access to
> > everything
> > that the Tokenizer is giving me correct?   Will I need to
> > store "term text"
> > in order to be able to access the actual term instead of
> > stemmed words?
> >
> > Thanks for all your help,
> >
> > --JP
> >
> > On 7/13/07, Ard Schrijvers <a.schrijvers@hippo.nl> wrote:
> > >
> > > Hello,
> > >
> > > > I'm wondering if after
> > > > opening the
> > > > index I can retrieve the Tokens (not the terms) of a
> > > > document, something
> > > > akin to IndexReader.Document(n).getTokenizer().
> > >
> > > It is obviously not possible to get the original tokens of
> > the document
> > > back when you haven't stored the document, because:
> > >
> > > 1) the analyzer might have removed stop words in the first place
> > > 2) the terms in lucene index are perhaps stemmed words /
> > synonyms / etc
> > > etc
> > > 3) how would you expect things like spaces, commas, dots etc to be
> > > restored?
> > >
> > > And, I think what you want does not comply with an inverted
> > index. When
> > > you do not store the document, you always loose information
> > about the
> > > document during indexing/analyzing
> > >
> > > How many documents are you talking about? They must be
> > either somewhere on
> > > FS or accessible over http...when you need the document,
> > why not just
> > > provide a link to the original location?
> > >
> > > Regards Ard
> > >
> > > >
> > > > In summary:
> > > >
> > > > My current ( too wasteful implementation is this)
> > > >
> > > > StandardTokenizer(BufferedReader (
> > > > IndexReader.Document(n).getField("text"
> > > > )  )
> > > >
> > > > I'm wondering if Lucene has a more efficient manner to
> > > > retrieve the tokens
> > > > of a document from an index.  Because it seems like it has
> > > > information about
> > > > every "term" already, Since you can get retrieve a
> > > > TermPositions object.
> > > >
> > > > Thanks,
> > > >
> > > >
> > > > --JP
> > > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message