db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoffrey Hendrey <geoff_hend...@yahoo.com>
Subject Re: Lucene integration
Date Mon, 16 Mar 2009 15:06:21 GMT

Yeah, that's exactly what I had in mind. The lucene server would receive the binary data,
unpack it, and use the Lucene API to create and modify the Lucene index. 

Does derby have utilities for unmarshalling the ".dat" format? There must be. Otherwise there
must at least be a clear spec for the binary format.

I'd like to clarify that lucene doesn't expect anything as "input". I think this is a common
misconception about lucene. In fact, although lucene does have some sample index builders
for document formats like HTML, the reality is that almost all applications that use lucene
simply parse the data that needs to be indexed and use the lucene IndexWriter to manually
stuff the desired information into the index. There is no need to retain the original data
source. In my current project I build a 12gb lucene index from parsing 1,950,000 source data
records.

On Mar 16, 2009, at 7:00 AM, Jørgen Løland <Jorgen.Loland@Sun.COM> wrote:

Geoffrey Hendrey wrote:
Would it be possible for the derby team to implement lucene support in the following way?
Hook into the asynchronous replication protocol to send committed transactions to a lucene
receiver. I think it is acceptable for the free text search to only "see" committed data.
Alterative to opening the protocol would be to create an abstract ReceiverServer for asynchronous
data, then LuceneReceiver is just a subclass. Thoughts? 

What does Lucene expect as input? I doubt that the replication code can be easily integrated
with Lucene because...

1) The information replication sends from a master to a slave is a physical transaction log,
which is in a derby-internal format. It is not human readable. To get an idea of what it looks
like, you can take a look at logN.dat in one of your databases' log/ directories.
2) Replication does not distinguish between committed and uncommitted data; log for all transactions,
committed or not, is sent to the slave.

This means that before anything is fed into Lucene, the information has to be processed. This
processing is effectively Derby's crash recovery code and is non-trivial to extract.

Note that I'm not familiar with Lucene.

-- 
Jørgen Løland


Mime
View raw message