lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Monsur Hossain" <mon...@monsur.com>
Subject RE: Splitting index into indexed fields and stored fields for performance
Date Wed, 11 May 2005 06:00:09 GMT


> -----Original Message-----
> From: Chris Lamprecht [mailto:clamprecht@gmail.com] 
> Sent: Thursday, April 28, 2005 7:53 PM
>
> Since the "stored fields" index would basically just be a 
> database, perhaps this is better served using a traditional 
> relational database (or even use the OS's file system).  
> However, a traditional database doesn't know what the docids 
> are (and docids change over time), -- may be one could store 
> a single stored field in the "search index", containing a 
> permanent identifier for the DB lookup.


Is it too late to revisit this?  I just ran across it today and
conicidentally have been thinking about the same issue.  I think you've
hit on a great point.

Lucene is optimized for searching, while a relational database is
optimized for ID lookups, so we're trying the scheme you've described
above.  We are doing this to strike a right balance between Lucene index
size (keeping it small) and performance (keeping db hits to a minimum).

We are developing a web paging application, which only displays 10
results at a time.  Suppose in Lucene we have WeblogID indexed as a
keyword and WeblogContent indexed as UnStored.  We completely ignore
Lucene's internal docid.  The searching process would look like this:

1) Search and get back a Hits collection

2) Loop through the current page of results, grabbing the WeblogIDs for
the 10 results to be displayed.

3) Incorporate those WeblogIDs into an sql query like so: SELECT
WeblogContent, OtherFields FROM Weblogs, OtherTables WHERE WeblogID IN
(ID1, ID2, ID3...).  

4) Use the results of this sql query to format the output.

An IN query with 10 parameters will be fast, especially if there's an
index on those IDs.  This works will since I'm paging 10 results at a
time; a sql query with 1000 parameters wouldn't be as friendly.  You'll
also notice the query above does a join; when the data is decoupled from
the Lucene index, you're free to grab it from various sources (which may
or may not be available at index time).  This method gives you that
flexibility.

Monsur


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message