lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Using more than one index
Date Tue, 13 Jun 2006 06:54:51 GMT

A couple of suggestions...

1) don't use multiple indexes.  create one index, with one document per
"thing" you want to return (in this case it sounds like books) and index
all of the relevent data about each thing in that doc.  If multiple people
worked on a book, add all of their names to the same field.  addd all of
the dates to the book doc -- if you need to distibguish the differnet
types of dates, make a seaprete field for each type.

If you *must* cross refrence...

2) make sure you aren't useing the Hits API to iterate over all the
results when gathering IDs -- use a lower level api (like a HitCollector)

3) use the FieldCache to get the IDs instead of he stored Document fields.

4) don't extract full ID lists from all of then indexes and then search
on one of the indexes again with the ID list ... use the ID lists
generated from the supporting indexes (people and dates) to build a Filter
that you can use when searching the main index.



: Date: Mon, 12 Jun 2006 12:22:30 +0300
: From: Mile Rosu <mile.rosu@level7.ro>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Using more than one index
:
: Hello,
:
: We have an application dealing with historical books. The books have
: metadata consisting of event dates, and person names among others.
: The FullText, Person and Date indexes were split until we realized that
: for a larger number of documents (400K) the combination of the
: sequential search hits took a way too long time to complete (15 min).
: The date index was built using the suggestion found at:
: http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (big
: thanks for the hint)
:
: Is there a recommended approach to combining results from different
: indexes (with different fields)?
:
: The indexes structure:
: MainIndex:
: 	Fields:
:       	@ID@ - keyword (document id)
: 		@FULLTEXT@ - tokenized (used for full text6 search)
: 		Ptitle - tokenized (used for full text publication title
: search)
: 		Dtitle - tokenized (used for full text document title
: search)
: 		Type - keyword - (used for document type)
:
: PersonIndex:
: 		@ID@ - keyword (document id == mainIndex.@ID@)
: 		Person - tokenized (full text person name search)
: DateIndex:
: 		@ID@ - keyword (document id == mainIdex.@ID@)
: 		Date - date as YYYYMMDD - keyword
: 		Type - type of date (document date, birth day, etc...)
: 		@YYYY@ - year of date
: 		@YYYYMM@ - year and month of date
: 		@DDD@ - decade
: 		@CC@ - century of date
:
:
: Eg:
: If I want to search for documents that contain: person "John", full text
: "book" and date: before 06/12/2005
: Step 1:  search in personIndex for John - retrieve all @ID@ from the hit
: list
: Step 2: search in DateIndex for documents that have dates before
: 06/12/2005 - retrieve id from the hit list
: Step 3: search in mainIndex for "book" - retrieve all @ID@
: Step 4: combine all the lists
: Step 5: search mainIndex for documents with the @ID@ from the combined
: id list
:
: Each search takes less then 1 second, but retrieving @ID@ from the index
: takes a lot more - the time increases by the number of hits. This is
: because when retrieving a field value from a document hit, the Lucene
: engine loads all the fields from the index (the entire document). So if
: in one search I get 300.000 hits cont, I have to iterate through all and
: retrieve the @ID@ field value - this takes a lot of time.
:
: Regards,
: Mile Rosu
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message