lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aad Nales <aad.na...@rotterdam-cs.com>
Subject Re: Configurable indexing of an RDBMS, has it been done before?
Date Mon, 07 Feb 2005 10:29:46 GMT
Yep,

This is how we do it.

We have a search.xml that maps database fields to search fields and a 
parameter part that describes the 'click for detailed result url' and 
the parameter names (based on the search fields). In this xml we also 
describe how the different fields should be stored we have for instance 
a number of large text fields for we use the unstored option.

The framework that we have build around has an element that we call 
detailer. This detailer creates a lucene Document with the fields as 
specified in the search.xml

To illustrate here is the code that specifies the detailer for a forum.


-------------------------------------------------- XML ---------------------
<documenttype id="FORUM" index="general" defaultfield="body">
<fields>
<field property="messageid" searchfield="messageid" type="unindexed" 
key="true"/>
<field property="instanceid" searchfield="instanceid" type="unindexed" />
<field property="subject" searchfield="title" type="split" maxwords="8" />
<field property="body" searchfield="default" type="split" maxwords="20" />
<field property="aka_username" searchfield="username" type="keyword" />
<field property="modifiedDateAsDate" searchfield="modifieddate" 
type="keyword" />
</fields>
<action uri="/forum/viewMessage.do" 
image="/htmlarea/images/cops_insert_threadlink.gif">
<parameter property="messageid" name="messageid"/>
<parameter property="instanceid" name="instanceid"/>
</action>
<analyzer 
classname="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
</documenttype>
-------------------------------- END XML -------------------

Please note:

Messageid is the keyfield here when we search the index we use a 
combined TYPE + KEY id to filter out double hits on the same document 
(not unusual in for instance a long forum thread).

Per type of document we also specifify what picture to show in the 
result (image), and we specify in what index the result should be 
written and what the general search field is (if the user submits a 
query without search all and without a field specified).

We have added the 'split' key word which makes it possbile to search a 
long text but only store a bit in the resulting hit.

The reindex is pretty straightforward we build a series of detailers for 
all possible document types and we run through the database and call the 
right detailer from a HashMap.

We have not included the JDBC stuff since the application is always 
running in Tomcat-Struts and since we cache most of the database reads. 
(a completely differnt story).

Queries on new and changed records seem to only make sense if asked in a 
context of time. (Right?). We have not needed it yet. The mapping can be 
query from a singleton java class. (SearchConfiguration).

We are currently adding functionality to store 'user structured data' 
best imagined as user defined input forms that are described in XML and 
are then stored as XML in the database. We query these documents using 
Lucene. These documents end up in the same index but this is quite 
manageable by using specialized detailers. For these document the type 
is more important then for the 'normally' stored documents. For this 
latter situation the search logic assumes that the query is 
appropriately configured by the application.

I am not sure if this is the kind of solution that you are looking for, 
but everything we produce is 100% open source.

Cheers,
Aad


David Spencer wrote:

> Many times I've written ad-hoc code that pulls in data from an RDBMS 
> and builds a Lucene index. The use case is a typical database-driven 
> dynamic website which would be a hassle to spider (say, due to tricky 
> authentication).
>
> I had a feeling this had been done in a general manner but didn't see 
> any code in the sandbox, nor did any searches turn it up.
>
> I've spent a few mins thinking this thru - what I'd expect is to be 
> able to configure is:
>
> 1. JDBC Driver + conn params
> 2. Query to do a 1 time full index
> 3. Query to show new records
> 4. Query to show changed records
> 5. Query to show deleted records
> 6. Query columns to Lucene Field name mapping
> 7. "Type" of each field name (e.g. the equivalent of the args to the 
> Field ctr)
>
> So a simple example, taking item 2 is
>
> query: "select url, name, body from foo"
>
> (now the column to field mapping)
> col 1 => url
> col 2 => title
> col 3 => contents
>
> (now the field types for each named field)
>
> url => Field( ...store=true, index=false)
> title => Field( ...store=true, index=true)
> contents => Field( ...store=false, index=true)
>
>
>
> And voilla, nice, elegant, data driven indexing.
> Does it exist?
> Should it? :)
>
> PS
> I know in the more general form, "query" needs to be replaced by 
> "queries" above, and the updated query may need some time stamp 
> variable expansion, and possibly the queries need "paging" to deal w/ 
> lamo DBs like mysql that don't have cursors for large result sets...
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message