lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oscar Picasso <oscgoo...@yahoo.com>
Subject Transactional Directories
Date Mon, 14 Feb 2005 18:02:23 GMT
Hi,

I am currently implementing a Directory backed by a Berkeley DB that I am
willing to release as an open source project.

Besides the internal implementation, it differs from the one in the sandbox in
that it is implemented with the Berkeley DB Java Edition.

Using the Java Edition allows an easier distribution as you just need to add a
single jar in your classpath and you have a fully functional Berkeley DB
embedded in your application without the hassle of installing the C Berkeley
DB.

While initially implemented with the Java Edition this Directory can easily be
ported to a Berkeley DB C edition or a Berkeley DB XML (for example to use
Berkeley DB XML + Lucene as the base for a document management system).

This implementation works fine and I am quite happy with its speed.

There is still an important problem I face and it has to do with how to deal
with some transactions. After all, the purpose of a Berkeley implementation, or
a JDBC one for that matter, is its ability to use transactions.

After looking at the Andy Varga code, it seems that the implementation in the
sandbox face the same problem (correct me if I am wrong). I have also learn
that the JDBC directory was not implemented with transactions in mind.

Here the problem. 

If I do something like that:
-- case A --
<pseudo-code>
+begin transaction
 new IndexWriter
 create/update/delete objects in the database
 index.addDocument (related to the objects)
 indexWriter.close()
+commit
</pseudo-code>

Everything is fine. The operations are transactionally protected. You can even
do many writes/updates. As far as everything in enclosed by the pairs
begin-transaction/new-index-writer ... index-writer.close/commit everything is
properly undone is case of any operation fails insidde the transaction.

For batch insertions the whole batch is rolled back but at least your object
database is consistent with the index.

If you do mostly batch insertions and relatively few random individual
insertions. That's fine.

However with a relatively high number of random insertions, the cost of the
"new IndexWriter / index.close()" performed for each insertion is two high.
Unfortunately this it is a common case for some kind of applications and it is
where a transactional directory would the most useful.

In such a case you would like to do something like that:
-- case B --
<pseudo-code>
new IndexWriter
 ...
+begin transaction-1
 create/update/delete objects in the database
 index.addDocument (related to the objects)
+ commit
...
+begin transaction-2
 create/update/delete objects in the database
 index.addDocument (related to the objects)
+ commit
...
indexWriter.close()
</pseudo-code>

The benefits would be to protect individual insertions while avoiding the cost
of using each time a new IndexWriter.

It doesn't work however. Here is my understanding. 

Suppose that in case B, transaction-1 fails and transaction-2 succeeds.

In that case the underlying database system rolls back all the writes done
during transaction-1 whether they were related to the objects stored in the
database or to the index (the writes done to the IndexOutput are also undone).
>From the database point of view consistency is maintained between the stored
object and the index.

The problem is that after transaction-1 Lucene still 'remembers' the segment(s)
it wrote during transaction-1. Later, Lucene might 'want' to perform some
operation based on these references (on merging the segments, I think) while
the underlying segment(s) files do not exist anymore. This is where an
Exception is thrown.

The solution would be to instruct Lucene to 'forget' or undo any reference to
the segments created during transaction-1 in case of rollback;

I have noticed that references to the segments are stored in a segmentInfos
map. I was thinking about removing the segmentsInfo map entries created during
transaction-1 in case of a rollback but I don't see if it's enough and/or
potentially dangerous.

I would really appreciate any comment about this idea and also about my
understanding of the Lucene indexing process.

If I/we could find a solution it would also benefit a JDBC Directory
implementation

Thanks.

Oscar

P.S.: If and when my implementation is fully functional, is there a place in
the Lucene project where I could release it? (Maybe the sandbox).



		
__________________________________ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message