lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Polites" <jason.poli...@gmail.com>
Subject Re: updating document
Date Fri, 11 Aug 2006 06:23:31 GMT
Unfortunately yes.  It doesn't really have anything to do with the way you
access the index (I don't think).  The fact is that the data is simply not
in the document.  When you add the document again it is effectively
"re-indexed", so if the raw data of the field is empty, then it won't be
indexed.

I'm in no way a Lucene expert, but this has been my experience.

In my case, I too have important data which I need to preserve during
updates, but it raises an important architecture point.  If you are not
storing the data, then you obviously don't need it for display (no doubt you
have it elsewhere when it is needed for display).  Storing the data simply
for the purposes of preserving it during updates will work, but has the
secondary effect of slowing down searches.  More specifically, slowing down
the retrieval of Hits.  When you call hits.doc(n), the stored values of
fields in the document will be loaded into memory.  The larger the amount of
data stored, the larger the memory consumption.  If you are compressing the
field (using the Field.Store.COMPRESS option) you will see even slower
performance as the field data needs to be decompressed before it is
retrieved.

There are apparently plans to introduce lazy loading of fields in subsequent
releases, which will be a great feature and will solve some of this, but you
still have the issue of needing to store data in the index which doesn't
really need to be there.

In my case, I elected to store this data elsewhere.  So, I have a "document"
(actually an email) which may contain large amounts of text which I want to
search on, but don't need to display from the index.  I index this text, but
don't store it.  When I want to update a document, I retrieve this text from
wherever I put it, and re-insert it into the lucene document.

It's effectively the same process, but it just alleviates the burden of
maintaining this data within the index.  Of course you then have to maintain
two sets of information, but the benefits will probably outweight the costs.

It can be as simple as storing the text content in a compressed file on your
file-system somewhere, keyed with an ID which is unique to the lucene
document it belongs to.  Or you could use a CLOB field (Text type in SQL
Server) in a database etc.

The real advantage of this is the speed gain from the index.  Lucene works
best when the index is light-weight.  My recommendation is to think
carefully about the "role" of the index, vs the role of your data storage
approach.

On 8/11/06, Deepan Chakravarthy <codeshepherd@gmail.com> wrote:
>
> On Fri, 2006-08-11 at 01:58 +1000, Jason Polites wrote:
> > Are your storing the contents of the fields in the index?  That is,
> > specifying Field.Store.YES when creating the field?
> >
> > In my experience fields which are not stored are not recoverable from
> the
> > index (well.. they can be reconstructed but it's a lossy process).  So
> when
> > you retrieve the document, you lose non-stored fields.
> >
>
> Yes we have some important fields that are not stored in the index. Is
> there a way to overcome this problem? while updating document.  Will i
> face the same problem with IndexModifier ? (Now I am using IndexReader
> and IndexWriter)
> Thanks
> Deepan
> www.codeshepherd.com
>
>
> > If you are searching on these fields then it would explain why you are
> > losing results.
> >
> > On 8/10/06, Deepan Chakravarthy <codeshepherd@gmail.com> wrote:
> > >
> > > On Thu, 2006-08-10 at 09:16 -0400, Erick Erickson wrote:
> > > > You say "Those documents that we updated are not searchable now".
> I've
> > > got
> > > > to ask the obvious question, did you close and re-open the
> *searcher*
> > > > (really, the indexreader you use in your searcher)? I suspect you
> have,
> > > but
> > > > thought I'd ask explicitly.
> > > >
> > > > I'd also get a copy of Luke (http://www.getopt.org/luke/) and
> inspect my
> > > > index after you drop/re-add the data.
> > > I have Luke. When i inspect the index with luke i find the same
> result,
> > > i.e the updated documents are not searchable in the new index.
> > >
> > > I guess Index Modifier used Index reader and writer internally. I am
> > > adding more fields to existing documents in index. so some of my
> > > documents will have n fields and other n+m fields after updating. Does
> > > the difference in number of fields affect search in any manner ?
> > >
> > >
> > > >
> > > > Actually, have you thought about IndexModifier (I'm using Lucene 2.0
> ).
> > > From
> > > > the javadoc....
> > > >
> > > > <<< A class to modify an index, i.e. to delete and add documents.
> This
> > > class
> > > > hides IndexReader<
> > >
> file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/index/IndexReader.html
> > > >and
> > > > IndexWriter<
> > >
> file:///C:/lucene-2.0.0/docs/api/org/apache/lucene/index/IndexWriter.html
> > > >so
> > > > that you do not need to care about implementation details such as
> that
> > > > adding documents is done via IndexWriter and deletion is done via
> > > > IndexReader.>>>
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On 8/9/06, Deepan Chakravarthy <codeshepherd@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > > We have to update few documents in our index. We have add a
> additional
> > > > > field to them. We did as follows
> > > > >
> > > > > 1)read the documents of our interest using IndexReader
> > > > > 2)copy them to a temporary doc object (temp_doc)
> > > > > 3)delete the document in the index
> > > > > 4)close the IndexReader
> > > > > 5)open the IndexWriter
> > > > > 6)add a new field to (temp_doc)
> > > > > 7)add the (temp_doc) to the index using IndexWriter
> > > > > 8)close the IndexWriter
> > > > >
> > > > >
> > > > > The problem:
> > > > > 1)Those documents that we updated are not searchable now. When we
> > > > > perform search based we not find any of those documents we
> updated.
> > > > > (using IndexSearcher)
> > > > >
> > > > > 2)But we are still able to read the updated documents using
> > > IndexReader.
> > > > >
> > > > >
> > > > > Questions
> > > > > 1)When i want to update a document by adding a field, should i
> reindex
> > > > > all the fields again? will copying the existing document not help
> and
> > > > > adding new field not help ?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message