lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giulio Cesare Solaroli <slrg...@ibn-italy.com>
Subject Re: Indexing a database
Date Tue, 10 Jun 2003 21:07:20 GMT
Hi Bryan,

I hope you don't mind me replying to the whole list too.

On Tuesday, Jun 10, 2003, at 22:35 Europe/Rome, Bryan LaPlante wrote:

> I would be interested in looking at your solution. I have written four
> objects so far and my solution gives me complete control over how each
> column in each row of a result set gets indexed or not indexed which 
> ever
> the case may be. I am working on a solution that will create an index 
> that
> will bring results based on business rules instead of text queries.

In this regard, your solution is definitely more sophisticated then our.
Our goal was to be able to replace Intermedia for some search we wanted 
to perform;
knowing in advance what kind of request we wanted to be able to answer 
did allow us to take a few shortcuts.

We have a custom query (with the SQL coded inside of our classes; I 
know its ugly, but did help us get out of the trouble much faster) that 
selects all the information we want to be able to search with Lucene; 
we include also Foreign key values to other meaningful tables.
The only field we store in the Lucene index is the primary key of the 
main record, as the search results are handled by an application that 
uses the primary key to load the real record straight from the DB. In 
this way we have considerably reduced the size of the Lucene index that 
at the moment is about 2.6 GB.

> My current problem to solve is how to split up the processing with out
> running out of memory or disk space or possibly just dropping the
> connection. I was thinking of creating a raw index where only the 
> identifier
> for each column and row would be saved, but I would rather like to 
> figure
> out a way to re query the database incrementally and handle it that 
> way. Any
> thoughts you have are welcome.

We need to keep indexed a table of about 3 million records, with 40 
thousand new records daily.
To handle this situation, we have created a support table, with 
basically only a foreign key to the table we want to index.
We have than create a trigger that, whenever a new record is inserted 
in the real table, inserts its primary key value also in the support 
table (we never update this records, just delete them, so a simple on 
insert trigger is enough for the moment).

At this point, you can use the "index" of the new record that haven't 
been indexed yet (provided by the support table) as a reference for 
going through the whole data set. Having a support table greatly 
simplify your task, as you can limit your batch to any size of records 
(say a 1000 record per run) adding a simple rownum < 1000 statement to 
your SQL query (this is the Oracle syntax, but I am shore any DB has 
its own way to provide the same feature).

While skipping to the selected dataset, all you have to do is:
- select one document (using the reference stored in the support table);
- index the document;
- delete the reference from the support table.

In this way, the list of documents to index is "persistent" and you can 
limit the data set size of you batch to any given size.

This is our basic solution for indexing "write only" tables. We had to 
enhance this same approach to be able to handle deletion too, but the 
ideas are always the same: keeping a separate list of the things to do 
to keep the index in synch with the content of the DB.

Did I express myself clearly enough? Does this make any sense to you?

Regards,

Giulio Cesare Solaroli


> ----- Original Message -----
> From: "Giulio Cesare Solaroli" <slrgcsa@ibn-italy.com>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Cc: "Bryan LaPlante" <blaplante@netwebapps.com>
> Sent: Tuesday, June 10, 2003 3:18 PM
> Subject: Re: Indexing a database
>
>
>> Hi,
>>
>> On Tuesday, Jun 10, 2003, at 22:03 Europe/Rome, Otis Gospodnetic 
>> wrote:
>>
>>> [...]
>>> On a side, somebody has already written a tool to make a Lucene index
>>> out of DB tables.  Why re-invent the wheel?
>>
>> we have build our own application to index the content of a DB.
>> Our application is not "general purpose", but is exactly wrapped on 
>> our
>> current needs.
>> We have been able to replace Intermedia indexes with excellent 
>> results,
>> both in indexing an searching time.
>>
>> At the moment we have almost 3 milions record indexed with about 40
>> thousand we documents inserted every day.
>>
>> If anybody is interested in a deeper description on how we did achieve
>> these results, just let me know.
>> We used a mix of support table / trigger on the DB side so our 
>> solution
>> won't scale well a general purpose DB to Lucene indexing engine, but
>> someone could get some interesting ideas.
>>
>> As the full description will be quite lenghty, I don't want to write 
>> it
>> without anybody interested in reading it! ;-]
>> Otherwise, I would be very pleased to give something back to the 
>> Lucene
>> comunity, as Lucene did help us very much in settling down our
>> search/indexing problems.
>>
>> Regards,
>>
>> Giulio Cesare Solaroli
>>
>> PS: no problem also on sharing our code.
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message