lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: What *is* a lucene document?
Date Sun, 05 Jun 2005 10:21:36 GMT

On Jun 5, 2005, at 1:11 AM, Phillip Rhodes wrote:
> I understand that  "Documents are the primary retrievable units  
> from a Lucene query"  But I don't know if I want to have 12  
> documents in the lucene index that represent the same business  
> object, or if I should place 12 different business documents within  
> the lucene index.

Deciding how to slice a domain into Documents is one of the most  
important decisions to make with Lucene usage, and not one that  
Lucene itself gives an answer to.  There are precedents that have  
been set and advice that users here can give, but ultimately how to  
represent your domain in Lucene is up to you.

> Here is the background:
> I want to index a product catalog (some data in database and some  
> data on the filesystem, I have cross-reference between the two).
> Each product is associated to attributes, categories and one or  
> more PDF/MS Word documents, HTML descriptions, images, etc...
> A product could have 12 different files associated to it.
> Is it okay if I create as many documents as assets that I want to  
> return from a search and add information to each document tying it  
> back to the product that it is assocated with?  Is that the right  
> approach?

Do users of your search system need to know about the PDF/Word/HTML  
documents?  Or should they simply know about "products"?  If all you  
need back is the product, then the simplest approach would be to  
create one Lucene Document per product, parse all the files and data  
associated with it and add it as text to fields.  If the search  
system is simple in that fielded search is not needed, simply create  
two fields per Document: id and text.  Field "id" is the product id,  
and "text" is an aggregation of all the text associated with the  
product regardless of where it came from (careful if you're doing  
string concatenation to put whitespace between so you don't blur  
words together).

There are many other ways to approach this and my recommendation is  
just the simplest one based on the description of your needs.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message