lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: What is the best way to handle the primary key case during lucene indexing
Date Mon, 16 Nov 2009 20:19:09 GMT
>From your original e-mail "if the metadata contains the primary key defined,
we have to do the search/update for every row based on the primary key".

Jake and I are both assuming that you're using primary key in the database
sense. That is, there is exactly one document in the index with that primary
key, composite or not. Your statement above seems to indicate otherwise,
in which case you should pretty much disregard anything we've said and
you're stuck running the query that assembles the list of hits and updating
each one.

If you really mean a primary key, Jake's suggestion is certainly the
easiest. If
you absolutely *can't* make a single-term primary key for your index, you
could
use the iw.deleteDocuments(Query) form and re-add the document.


Best
Erick

On Mon, Nov 16, 2009 at 3:00 PM, Jake Mannix <jake.mannix@gmail.com> wrote:

> You will want to have one Lucene field which contains this composite key -
> they could
> be the un-tokenized concatenation of all of the subkeys, for example, and
> then one Term
> would have the full composite key, and the updateDocument technique would
> work fine.
>
>  -jake
>
> On Mon, Nov 16, 2009 at 11:09 AM, java8964 java8964 <java8964@hotmail.com
> >wrote:
>
> >
> > But can IndexWriter.updateDocument(Term, Document) handle the composite
> key
> > case?
> >
> > If my primary key contains field1 and field2, can I use one Term to
> include
> > both field1 and field2?
> >
> > Thanks
> >
> > > Date: Mon, 16 Nov 2009 09:44:35 -0800
> > > Subject: Re: What is the best way to handle the primary key case during
> > lucene        indexing
> > > From: jake.mannix@gmail.com
> > > To: java-user@lucene.apache.org
> > >
> > > The usual way to do this is to use:
> > >
> > >    IndexWriter.updateDocument(Term, Document)
> > >
> > > This method deletes all documents with the given Term in it (this would
> > be
> > > your primary key), and then adds the Document you want to add.  This is
> > the
> > > traditional way to do updates, and it is fast.
> > >
> > >   -jake
> > >
> > >
> > >
> > > On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <
> java8964@hotmail.com
> > >wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > In our application, we will allow the user to create a primary key
> > defined
> > > > in the document. We are using lucene 2.9.
> > > > In this case, when we index the data coming from the client, if the
> > > > metadata contains the primary key defined,
> > > > we have to do the search/update for every row based on the primary
> key.
> > > >
> > > > Here is our current problems:
> > > >
> > > > 1) If the meta data coming from client defined a primary key (which
> can
> > > > contain one or multi fields),
> > > >    then for the data supplied from the client, we have to make sure
> > that
> > > > later row will override the previous row, if they have the same
> primary
> > key
> > > > as the data.
> > > > 2) To do the above, we have to loop through the data first, to check
> if
> > any
> > > > later rows containing the same PK as the previous rows, so we will
> > build the
> > > > MAP in the memory to override the previous one by the latest ones.
> > > > This is a very expensive operation.
> > > > 3) Even in this case, for every row after the above filter steps, we
> > still
> > > > have to search the current index to see if any data with the same PK
> > exist
> > > > or not. So we have to do the remove before we add the new data in the
> > index.
> > > >
> > > > I want to know if anyone has the same requirement like this PK using
> > the
> > > > lucene? What is the best way to index data in this case?
> > > >
> > > > First, I am thinking if it is possible to remove the above step2?
> > > > the problem for the lucene is that when we add document in the index,
> > we
> > > > can NOT search it before commit it.
> > > > But we only commit once when the whole data file is finished. So we
> > have to
> > > > loop through the data once to check to see if any data sharing the
> same
> > PK
> > > > in the data file.
> > > > I am wondering if there is a way in the index writer, before it
> commits
> > > > anything, when we add the new document into it, it can do the merging
> > of the
> > > > PK data? What I mean is that if the same PK data already exist in any
> > > > previous added document, just remove it and let the new added data
> > > > containing the same PK data take the place? If we can do this, then
> the
> > > > whole pre checking data step can be removed.
> > > >
> > > > Second, for the above step 3, if the searching the existing index is
> > NOT
> > > > avoidable, what is the fast way to search by the PK? Of course we
> > already
> > > > indexed all the PK fields. When we add new data, we have to search
> > every row
> > > > of existing index by the PK fields, to see if it exist or not. If it
> > does,
> > > > remove it and add the new one.
> > > > We constructor the query by the PK fields at run time, then search it
> > row
> > > > by row. This is also very bad as the indexing the data for
> performance.
> > > >
> > > > Here is what I am thinking?
> > > > 1) Can I use the Indexreader.term(terms)? I heard it is much faster
> > than
> > > > the query searching? Is that right?
> > > > 2) Currently we are do the search row by row? Should I do it in
> > batching?
> > > > Like I will combine 100 PK search into one search, using Boolean
> term?
> > So
> > > > one search will give me back all the data in this 100 PK which are in
> > the
> > > > index. Then I can remove them from the index using the result set. In
> > this
> > > > case, I only need to do 1/100 search requests as before? This will
> much
> > > > faster than row by row in theory.
> > > >
> > > >
> > > > Please let me know any feedbacks? If you ever dealed with PK data
> > support,
> > > > please share some thougths and experience.
> > > >
> > > > Thanks for your kind help.
> > > >
> > > > _________________________________________________________________
> > > > Hotmail: Free, trusted and rich email service.
> > > > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> > > >
> >
> > _________________________________________________________________
> > Hotmail: Powerful Free email with security by Microsoft.
> > http://clk.atdmt.com/GBL/go/171222986/direct/01/
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message