lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <>
Subject What is the best way to handle the primary key case during lucene indexing
Date Mon, 16 Nov 2009 17:15:40 GMT


In our application, we will allow the user to create a primary key defined in the document.
We are using lucene 2.9.
In this case, when we index the data coming from the client, if the metadata contains the
primary key defined, 
we have to do the search/update for every row based on the primary key.

Here is our current problems:

1) If the meta data coming from client defined a primary key (which can contain one or multi
    then for the data supplied from the client, we have to make sure that later row will override
the previous row, if they have the same primary key as the data.
2) To do the above, we have to loop through the data first, to check if any later rows containing
the same PK as the previous rows, so we will build the MAP in the memory to override the previous
one by the latest ones.
This is a very expensive operation. 
3) Even in this case, for every row after the above filter steps, we still have to search
the current index to see if any data with the same PK exist or not. So we have to do the remove
before we add the new data in the index.

I want to know if anyone has the same requirement like this PK using the lucene? What is the
best way to index data in this case?

First, I am thinking if it is possible to remove the above step2?
the problem for the lucene is that when we add document in the index, we can NOT search it
before commit it.
But we only commit once when the whole data file is finished. So we have to loop through the
data once to check to see if any data sharing the same PK in the data file.
I am wondering if there is a way in the index writer, before it commits anything, when we
add the new document into it, it can do the merging of the PK data? What I mean is that if
the same PK data already exist in any previous added document, just remove it and let the
new added data containing the same PK data take the place? If we can do this, then the whole
pre checking data step can be removed.

Second, for the above step 3, if the searching the existing index is NOT avoidable, what is
the fast way to search by the PK? Of course we already indexed all the PK fields. When we
add new data, we have to search every row of existing index by the PK fields, to see if it
exist or not. If it does, remove it and add the new one.
We constructor the query by the PK fields at run time, then search it row by row. This is
also very bad as the indexing the data for performance.

Here is what I am thinking?
1) Can I use the Indexreader.term(terms)? I heard it is much faster than the query searching?
Is that right?
2) Currently we are do the search row by row? Should I do it in batching? Like I will combine
100 PK search into one search, using Boolean term? So one search will give me back all the
data in this 100 PK which are in the index. Then I can remove them from the index using the
result set. In this case, I only need to do 1/100 search requests as before? This will much
faster than row by row in theory.

Please let me know any feedbacks? If you ever dealed with PK data support, please share some
thougths and experience.

Thanks for your kind help.
Hotmail: Free, trusted and rich email service.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message