Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 36954 invoked from network); 7 Feb 2005 10:29:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 7 Feb 2005 10:29:54 -0000 Received: (qmail 9542 invoked by uid 500); 7 Feb 2005 10:29:52 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 8941 invoked by uid 500); 7 Feb 2005 10:29:50 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 8928 invoked by uid 99); 7 Feb 2005 10:29:50 -0000 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=FORGED_RCVD_HELO X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from otto.isd-holland.nl (HELO otto2.isd-holland.nl) (62.221.254.21) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 07 Feb 2005 02:29:50 -0800 Received: from [192.168.0.50] (dsl-82-199-135-164.dutchdsl.nl [82.199.135.164]) by otto2.isd-holland.nl (Postfix) with ESMTP id 0AEDF3640C5 for ; Mon, 7 Feb 2005 11:29:52 +0100 (CET) Message-ID: <4207431A.2040202@rotterdam-cs.com> Date: Mon, 07 Feb 2005 11:29:46 +0100 From: Aad Nales User-Agent: Mozilla Thunderbird 1.0 (X11/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Configurable indexing of an RDBMS, has it been done before? References: <420721D0.6060906@tropo.com> In-Reply-To: <420721D0.6060906@tropo.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Yep, This is how we do it. We have a search.xml that maps database fields to search fields and a parameter part that describes the 'click for detailed result url' and the parameter names (based on the search fields). In this xml we also describe how the different fields should be stored we have for instance a number of large text fields for we use the unstored option. The framework that we have build around has an element that we call detailer. This detailer creates a lucene Document with the fields as specified in the search.xml To illustrate here is the code that specifies the detailer for a forum. -------------------------------------------------- XML --------------------- -------------------------------- END XML ------------------- Please note: Messageid is the keyfield here when we search the index we use a combined TYPE + KEY id to filter out double hits on the same document (not unusual in for instance a long forum thread). Per type of document we also specifify what picture to show in the result (image), and we specify in what index the result should be written and what the general search field is (if the user submits a query without search all and without a field specified). We have added the 'split' key word which makes it possbile to search a long text but only store a bit in the resulting hit. The reindex is pretty straightforward we build a series of detailers for all possible document types and we run through the database and call the right detailer from a HashMap. We have not included the JDBC stuff since the application is always running in Tomcat-Struts and since we cache most of the database reads. (a completely differnt story). Queries on new and changed records seem to only make sense if asked in a context of time. (Right?). We have not needed it yet. The mapping can be query from a singleton java class. (SearchConfiguration). We are currently adding functionality to store 'user structured data' best imagined as user defined input forms that are described in XML and are then stored as XML in the database. We query these documents using Lucene. These documents end up in the same index but this is quite manageable by using specialized detailers. For these document the type is more important then for the 'normally' stored documents. For this latter situation the search logic assumes that the query is appropriately configured by the application. I am not sure if this is the kind of solution that you are looking for, but everything we produce is 100% open source. Cheers, Aad David Spencer wrote: > Many times I've written ad-hoc code that pulls in data from an RDBMS > and builds a Lucene index. The use case is a typical database-driven > dynamic website which would be a hassle to spider (say, due to tricky > authentication). > > I had a feeling this had been done in a general manner but didn't see > any code in the sandbox, nor did any searches turn it up. > > I've spent a few mins thinking this thru - what I'd expect is to be > able to configure is: > > 1. JDBC Driver + conn params > 2. Query to do a 1 time full index > 3. Query to show new records > 4. Query to show changed records > 5. Query to show deleted records > 6. Query columns to Lucene Field name mapping > 7. "Type" of each field name (e.g. the equivalent of the args to the > Field ctr) > > So a simple example, taking item 2 is > > query: "select url, name, body from foo" > > (now the column to field mapping) > col 1 => url > col 2 => title > col 3 => contents > > (now the field types for each named field) > > url => Field( ...store=true, index=false) > title => Field( ...store=true, index=true) > contents => Field( ...store=false, index=true) > > > > And voilla, nice, elegant, data driven indexing. > Does it exist? > Should it? :) > > PS > I know in the more general form, "query" needs to be replaced by > "queries" above, and the updated query may need some time stamp > variable expansion, and possibly the queries need "paging" to deal w/ > lamo DBs like mysql that don't have cursors for large result sets... > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org