lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 1world1love <jd_co...@yahoo.com>
Subject retrieve all docs efficiently - just one field
Date Tue, 10 Jun 2008 22:35:33 GMT

Greetings all. I have read many posts concerning similar use cases, but I am
still a little hazy on the best way to achieve what I need to do. Here is
the background:

2 million documents with multiple sections, some sections contain structured
data, some unstructured.

We parse the docs and place the structured stuff in oracle where each
section is a table and one master table to relate them all.

We index the unstructured sections with lucene where each section is a
document (meaning a total of about ~30 million documents) with extra fields
including one for the primary key of the master table and then some meta
fields to describe the section - type, date, etc.

For a common use case, say we have a table called demographics with a number
field that represents age (overly simplistic but gets the point across).

So say we want all people over the age of 50 who may have visited Panama: 

--
We have our lucene index and we want to search the section text for the word
"panama" 

AND

We want to select from the demographics table where age > 50.
--

Now I need to intersect the master table IDs from my lucene hits and my
table results. 

I have a java stored procedure that runs the lucene query and creates a
temporary table with a single column where I insert the master id from the
hits of my lucene query. I then can do a join with my structured query
results.

The problem here is obviously the speed of iterating through the hits to
extract the single field that I need.

Notes: 
- I must be able to get a full set of results, though I only need the one id
field
- We originally went with Oracle text which was simple, but limited and
quite slow for most queries


I have read a little about the hitcollector class and the fieldselector api,
but I am still not sure how they may help me or even if they can.

I have also tooled around with the idea of using termdocs, but the queries
may get a little complex with various ors/ands/nots, though probably not
spans and so forth.

Any suggestions will be greatly apreciated.

Thanks,

J

-- 
View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message