lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject RE: What is the best way to handle the primary key case during lucene indexing
Date Mon, 16 Nov 2009 19:09:14 GMT

But can IndexWriter.updateDocument(Term, Document) handle the composite key case?

If my primary key contains field1 and field2, can I use one Term to include both field1 and
field2?

Thanks

> Date: Mon, 16 Nov 2009 09:44:35 -0800
> Subject: Re: What is the best way to handle the primary key case during lucene 	indexing
> From: jake.mannix@gmail.com
> To: java-user@lucene.apache.org
> 
> The usual way to do this is to use:
> 
>    IndexWriter.updateDocument(Term, Document)
> 
> This method deletes all documents with the given Term in it (this would be
> your primary key), and then adds the Document you want to add.  This is the
> traditional way to do updates, and it is fast.
> 
>   -jake
> 
> 
> 
> On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <java8964@hotmail.com>wrote:
> 
> >
> > Hi,
> >
> > In our application, we will allow the user to create a primary key defined
> > in the document. We are using lucene 2.9.
> > In this case, when we index the data coming from the client, if the
> > metadata contains the primary key defined,
> > we have to do the search/update for every row based on the primary key.
> >
> > Here is our current problems:
> >
> > 1) If the meta data coming from client defined a primary key (which can
> > contain one or multi fields),
> >    then for the data supplied from the client, we have to make sure that
> > later row will override the previous row, if they have the same primary key
> > as the data.
> > 2) To do the above, we have to loop through the data first, to check if any
> > later rows containing the same PK as the previous rows, so we will build the
> > MAP in the memory to override the previous one by the latest ones.
> > This is a very expensive operation.
> > 3) Even in this case, for every row after the above filter steps, we still
> > have to search the current index to see if any data with the same PK exist
> > or not. So we have to do the remove before we add the new data in the index.
> >
> > I want to know if anyone has the same requirement like this PK using the
> > lucene? What is the best way to index data in this case?
> >
> > First, I am thinking if it is possible to remove the above step2?
> > the problem for the lucene is that when we add document in the index, we
> > can NOT search it before commit it.
> > But we only commit once when the whole data file is finished. So we have to
> > loop through the data once to check to see if any data sharing the same PK
> > in the data file.
> > I am wondering if there is a way in the index writer, before it commits
> > anything, when we add the new document into it, it can do the merging of the
> > PK data? What I mean is that if the same PK data already exist in any
> > previous added document, just remove it and let the new added data
> > containing the same PK data take the place? If we can do this, then the
> > whole pre checking data step can be removed.
> >
> > Second, for the above step 3, if the searching the existing index is NOT
> > avoidable, what is the fast way to search by the PK? Of course we already
> > indexed all the PK fields. When we add new data, we have to search every row
> > of existing index by the PK fields, to see if it exist or not. If it does,
> > remove it and add the new one.
> > We constructor the query by the PK fields at run time, then search it row
> > by row. This is also very bad as the indexing the data for performance.
> >
> > Here is what I am thinking?
> > 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> > the query searching? Is that right?
> > 2) Currently we are do the search row by row? Should I do it in batching?
> > Like I will combine 100 PK search into one search, using Boolean term? So
> > one search will give me back all the data in this 100 PK which are in the
> > index. Then I can remove them from the index using the result set. In this
> > case, I only need to do 1/100 search requests as before? This will much
> > faster than row by row in theory.
> >
> >
> > Please let me know any feedbacks? If you ever dealed with PK data support,
> > please share some thougths and experience.
> >
> > Thanks for your kind help.
> >
> > _________________________________________________________________
> > Hotmail: Free, trusted and rich email service.
> > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> >
 		 	   		  
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/171222986/direct/01/
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message