lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lebiram <lebi...@ymail.com>
Subject Re: Extract the text that was indexed
Date Tue, 30 Dec 2008 16:13:41 GMT
Hi All, 

Thanks for the reply. 

Just wanted to reconstruct a new index based on an existing index(but turning off norms) that's
all. 

However, as it is nearly impossible to extract the terms  of unstored fields, we might think
of other ways.

Thanks for the inputs guys! 




________________________________
From: Erick Erickson <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Sent: Tuesday, December 30, 2008 3:41:46 PM
Subject: Re: Extract the text that was indexed

Actually, you can reconstruct the text, but it's a lossy process. Stop words
aren't in the index for instance. And it's very time-consuming. Luke makes
a "best guess" at this process, so you might want to take a look at that
code. But even the very bright folks who put Luke together caution that
it's not a perfect process.

What do you want to accomplish by reconstructing the text? Perhaps there's
a better way....

I'd also add parenthetically that while storing text certainly bulks up your
index, it doesn't affect searching proportionately since searching is
accomplished with indexed terms only. It may, however, substantially
increase document loading time (i.e. IndexReader.document(i)), which
*can* affect search time if you load a bunch of documents as part of
your processing. But this can be handled with lazy loading if it's an issue.

Best
Erick

On Tue, Dec 30, 2008 at 9:40 AM, Greg Shackles <gshackles@gmail.com> wrote:

> That is my understanding of it too.  Terms in the index will point to the
> position of the tokens they map to.  Since one index term can point at any
> number of tokens, this isn't a sequence map, but just a search map.  If you
> still have the text that was indexed you could run it through an analyzer
> and observe the tokens as they go through.
>
> - Greg
>
> On Tue, Dec 30, 2008 at 7:31 AM, Alexander Aristov <
> alexander.aristov@gmail.com> wrote:
>
> > I am not sure but from my understanding fields that are only indexed and
> > not
> > stored do not keep position. So even if you get back all terms for a
> field
> > for a given document you won't be able to reconstruct original words
> > sequence.
> >
> > And remember that not all words are indexed.
> >
> > Alex
> >
> > 2008/12/30 Lebiram <lebiram@ymail.com>
> >
> > > Hi All,
> > >
> > > Is it possible to extract the text that was indexed but not stored for
> a
> > > field in a document?
> > >
> > > Right now, reader.document() returns only fields that was stored.
> However
> > > I'd also want to get the text on the indexed only field...
> > >
> > > I'd appreciate your help
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
>



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message