lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Paul Sondag" <jsond...@uiuc.edu>
Subject Re: Does Index have a Tokenizer Built into it
Date Tue, 17 Jul 2007 17:21:06 GMT
Hi,

I've been looking into the indexing documents with the vectors for terms and
positions on to solve my problem.  However, I've run into a bit of a snag.
After indexing I have been able to retrieve the TermPositionVector from the
index and it has all of the data, but I cannot find a way where given a
position I can retrieve the term at that position. Which is how I was hoping
to create my contextual snippets.

They have functions where given a term you can get it's position but I see
no method to achieve the reverse affect.  Is there another class I need to
use for this?

--JP

On 7/16/07, John Paul Sondag <jsondag2@uiuc.edu> wrote:
>
> Some of the data sets that will be using have about 2 TB of data (90
> million web pages).  The Snippet I will be generating I would like to
> include the words that are being queried, so I don't want to simply store
> the first 2 or 3 lines.  I have looked at the HighlighterTest and I do
> believe that it requires the entire text of the document.  However, unlike
> the highlighter I know where the termOffset in the document.
>
> The input to my Snippet will be a vector of querywords and their offsets
> in the document.  (not their position in the document).  I'm reading about
> the option "term vectors" I can store while indexing my data.  It seems to
> be much more efficient than storing the entire document, I'm just not sure
> if the "term offset" is the same as a "token offset".  Here's what I'm
> reading in case I'm totally off the ball here and this is useless to me:
>
> http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors
>
> It seems like this has all the information that I would have if I
> tokenized the document anyways, or am I missing something?
>
> Thanks again for all the help!
>
> --JP
>
>
>
>
> On 7/16/07, Ard Schrijvers < a.schrijvers@hippo.nl> wrote:
> >
> > Hello,
> >
> > > Ard,
> > >
> > > I do have access to the URL's of the documents, but because I
> > > will be making
> > > short snippets for many pages (suppose it had about 20 hits
> > > per page and I
> > > need to make Snippets for each of them) I was worried it would be
> > > inefficient to open each "hit" tokenize it and then make the
> > > Snippet, of
> >
> > Yes, getting all the documents over http just to get the snippet, for
> > example the first 2 lines, is really bad for your performance in search
> > overviews.
> >
> > Logically, what you want to show, you need to store in your index. For
> > example, if for search hits you need to show the title and subtitle, just
> > store these two in the index. If you want to have a google like highlighter
> > of text snippets where the term occured, you need to store the entire text
> > IIRC (see HighlighterTest in lucene).
> >
> > How many docs are you talking about that you cannot store the entire
> > content?
> >
> > You could also just index the content and not store it, and in another
> > lucene field, store the first 2 or 3 lines of  the document, which serve as
> > text snippet. Making correct extracts of text snippets is very hard (see
> > lingpipe for example)
> >
> > Regards Ard
> >
> > > course the price of this may be worth the price of the increased Index
> > > size.  I have been looking into storing "Field Vectors with
> > > positions" in
> > > the index.  It seems that by doing this I will have access to
> > > everything
> > > that the Tokenizer is giving me correct?   Will I need to
> > > store "term text"
> > > in order to be able to access the actual term instead of
> > > stemmed words?
> > >
> > > Thanks for all your help,
> > >
> > > --JP
> > >
> > > On 7/13/07, Ard Schrijvers <a.schrijvers@hippo.nl> wrote:
> > > >
> > > > Hello,
> > > >
> > > > > I'm wondering if after
> > > > > opening the
> > > > > index I can retrieve the Tokens (not the terms) of a
> > > > > document, something
> > > > > akin to IndexReader.Document (n).getTokenizer().
> > > >
> > > > It is obviously not possible to get the original tokens of
> > > the document
> > > > back when you haven't stored the document, because:
> > > >
> > > > 1) the analyzer might have removed stop words in the first place
> > > > 2) the terms in lucene index are perhaps stemmed words /
> > > synonyms / etc
> > > > etc
> > > > 3) how would you expect things like spaces, commas, dots etc to be
> > > > restored?
> > > >
> > > > And, I think what you want does not comply with an inverted
> > > index. When
> > > > you do not store the document, you always loose information
> > > about the
> > > > document during indexing/analyzing
> > > >
> > > > How many documents are you talking about? They must be
> > > either somewhere on
> > > > FS or accessible over http...when you need the document,
> > > why not just
> > > > provide a link to the original location?
> > > >
> > > > Regards Ard
> > > >
> > > > >
> > > > > In summary:
> > > > >
> > > > > My current ( too wasteful implementation is this)
> > > > >
> > > > > StandardTokenizer(BufferedReader (
> > > > > IndexReader.Document(n).getField("text"
> > > > > )  )
> > > > >
> > > > > I'm wondering if Lucene has a more efficient manner to
> > > > > retrieve the tokens
> > > > > of a document from an index.  Because it seems like it has
> > > > > information about
> > > > > every "term" already, Since you can get retrieve a
> > > > > TermPositions object.
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > > --JP
> > > > >
> > > >
> > > >
> > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message