lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Configurable indexing of an RDBMS, has it been done before?
Date Mon, 07 Feb 2005 19:06:06 GMT
Nice, very similar to what I was thinking of, where the most significant 
difference is probably just that I was thinking of a batch indexer, not 
one embedded in a web container. Probably a worthwhile contribution to 
the sandbox.

Aad Nales wrote:
> Yep,
> This is how we do it.
> We have a search.xml that maps database fields to search fields and a 
> parameter part that describes the 'click for detailed result url' and 
> the parameter names (based on the search fields). In this xml we also 
> describe how the different fields should be stored we have for instance 
> a number of large text fields for we use the unstored option.
> The framework that we have build around has an element that we call 
> detailer. This detailer creates a lucene Document with the fields as 
> specified in the search.xml
> To illustrate here is the code that specifies the detailer for a forum.
> -------------------------------------------------- XML 
> ---------------------
> <documenttype id="FORUM" index="general" defaultfield="body">
> <fields>
> <field property="messageid" searchfield="messageid" type="unindexed" 
> key="true"/>
> <field property="instanceid" searchfield="instanceid" type="unindexed" />
> <field property="subject" searchfield="title" type="split" maxwords="8" />
> <field property="body" searchfield="default" type="split" maxwords="20" />
> <field property="aka_username" searchfield="username" type="keyword" />
> <field property="modifiedDateAsDate" searchfield="modifieddate" 
> type="keyword" />
> </fields>
> <action uri="/forum/" 
> image="/htmlarea/images/cops_insert_threadlink.gif">
> <parameter property="messageid" name="messageid"/>
> <parameter property="instanceid" name="instanceid"/>
> </action>
> <analyzer 
> classname="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> </documenttype>
> -------------------------------- END XML -------------------
> Please note:
> Messageid is the keyfield here when we search the index we use a 
> combined TYPE + KEY id to filter out double hits on the same document 
> (not unusual in for instance a long forum thread).
> Per type of document we also specifify what picture to show in the 
> result (image), and we specify in what index the result should be 
> written and what the general search field is (if the user submits a 
> query without search all and without a field specified).
> We have added the 'split' key word which makes it possbile to search a 
> long text but only store a bit in the resulting hit.
> The reindex is pretty straightforward we build a series of detailers for 
> all possible document types and we run through the database and call the 
> right detailer from a HashMap.
> We have not included the JDBC stuff since the application is always 
> running in Tomcat-Struts and since we cache most of the database reads. 
> (a completely differnt story).
> Queries on new and changed records seem to only make sense if asked in a 
> context of time. (Right?). We have not needed it yet. The mapping can be 
> query from a singleton java class. (SearchConfiguration).
> We are currently adding functionality to store 'user structured data' 
> best imagined as user defined input forms that are described in XML and 
> are then stored as XML in the database. We query these documents using 
> Lucene. These documents end up in the same index but this is quite 
> manageable by using specialized detailers. For these document the type 
> is more important then for the 'normally' stored documents. For this 
> latter situation the search logic assumes that the query is 
> appropriately configured by the application.
> I am not sure if this is the kind of solution that you are looking for, 
> but everything we produce is 100% open source.
> Cheers,
> Aad
> David Spencer wrote:
>> Many times I've written ad-hoc code that pulls in data from an RDBMS 
>> and builds a Lucene index. The use case is a typical database-driven 
>> dynamic website which would be a hassle to spider (say, due to tricky 
>> authentication).
>> I had a feeling this had been done in a general manner but didn't see 
>> any code in the sandbox, nor did any searches turn it up.
>> I've spent a few mins thinking this thru - what I'd expect is to be 
>> able to configure is:
>> 1. JDBC Driver + conn params
>> 2. Query to do a 1 time full index
>> 3. Query to show new records
>> 4. Query to show changed records
>> 5. Query to show deleted records
>> 6. Query columns to Lucene Field name mapping
>> 7. "Type" of each field name (e.g. the equivalent of the args to the 
>> Field ctr)
>> So a simple example, taking item 2 is
>> query: "select url, name, body from foo"
>> (now the column to field mapping)
>> col 1 => url
>> col 2 => title
>> col 3 => contents
>> (now the field types for each named field)
>> url => Field(, index=false)
>> title => Field(, index=true)
>> contents => Field(, index=true)
>> And voilla, nice, elegant, data driven indexing.
>> Does it exist?
>> Should it? :)
>> PS
>> I know in the more general form, "query" needs to be replaced by 
>> "queries" above, and the updated query may need some time stamp 
>> variable expansion, and possibly the queries need "paging" to deal w/ 
>> lamo DBs like mysql that don't have cursors for large result sets...
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message