lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: What is the best way to handle the primary key case during lucene indexing
Date Mon, 16 Nov 2009 17:44:35 GMT
The usual way to do this is to use:

   IndexWriter.updateDocument(Term, Document)

This method deletes all documents with the given Term in it (this would be
your primary key), and then adds the Document you want to add.  This is the
traditional way to do updates, and it is fast.

  -jake



On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <java8964@hotmail.com>wrote:

>
> Hi,
>
> In our application, we will allow the user to create a primary key defined
> in the document. We are using lucene 2.9.
> In this case, when we index the data coming from the client, if the
> metadata contains the primary key defined,
> we have to do the search/update for every row based on the primary key.
>
> Here is our current problems:
>
> 1) If the meta data coming from client defined a primary key (which can
> contain one or multi fields),
>    then for the data supplied from the client, we have to make sure that
> later row will override the previous row, if they have the same primary key
> as the data.
> 2) To do the above, we have to loop through the data first, to check if any
> later rows containing the same PK as the previous rows, so we will build the
> MAP in the memory to override the previous one by the latest ones.
> This is a very expensive operation.
> 3) Even in this case, for every row after the above filter steps, we still
> have to search the current index to see if any data with the same PK exist
> or not. So we have to do the remove before we add the new data in the index.
>
> I want to know if anyone has the same requirement like this PK using the
> lucene? What is the best way to index data in this case?
>
> First, I am thinking if it is possible to remove the above step2?
> the problem for the lucene is that when we add document in the index, we
> can NOT search it before commit it.
> But we only commit once when the whole data file is finished. So we have to
> loop through the data once to check to see if any data sharing the same PK
> in the data file.
> I am wondering if there is a way in the index writer, before it commits
> anything, when we add the new document into it, it can do the merging of the
> PK data? What I mean is that if the same PK data already exist in any
> previous added document, just remove it and let the new added data
> containing the same PK data take the place? If we can do this, then the
> whole pre checking data step can be removed.
>
> Second, for the above step 3, if the searching the existing index is NOT
> avoidable, what is the fast way to search by the PK? Of course we already
> indexed all the PK fields. When we add new data, we have to search every row
> of existing index by the PK fields, to see if it exist or not. If it does,
> remove it and add the new one.
> We constructor the query by the PK fields at run time, then search it row
> by row. This is also very bad as the indexing the data for performance.
>
> Here is what I am thinking?
> 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> the query searching? Is that right?
> 2) Currently we are do the search row by row? Should I do it in batching?
> Like I will combine 100 PK search into one search, using Boolean term? So
> one search will give me back all the data in this 100 PK which are in the
> index. Then I can remove them from the index using the result set. In this
> case, I only need to do 1/100 search requests as before? This will much
> faster than row by row in theory.
>
>
> Please let me know any feedbacks? If you ever dealed with PK data support,
> please share some thougths and experience.
>
> Thanks for your kind help.
>
> _________________________________________________________________
> Hotmail: Free, trusted and rich email service.
> http://clk.atdmt.com/GBL/go/171222984/direct/01/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message