lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tmole...@uw.edu>
Subject SolrJ updating indexed documents?
Date Mon, 02 Apr 2012 20:07:00 GMT
I am working on a component for indexing documents from a database that contains medical records.
The information is organized across several tables and I am supposed to index records for
varying sizes of sets of patients for others to do IR experiments with. Each patient record
has one or more main documents associated with it, and each main document has zero or more
addenda associated with it. (The main documents and addenda are treated alike for the most
part, except for a parent record field that is null for main documents and has the number
of a main document for addenda. Addenda cannot have addenda.) Also, each main document has
one or more diagnosis records. I am trying to figure out the best performing way to select
all of the records for each patient, including the main documents, addenda and diagnoses.

I tried indexing sets of these records using DataImportHandler and nested Entity blocks in
a way similar to the Full Import example on the http://wiki.apache.org/solr/DataImportHandler
page, with a select for all patients and main records in a data set, and nested selects that
get all of the addenda and all of the diagnoses for each patient, but it didn't run very fast
and a database resource person who looked into it with me said that issuing a million SQL
queries for addenda and a million queries for diagnoses, one each for the million patient
documents in a typical set of 10,000 patients, was very inefficient, and I should look for
a different way of getting the data.

I switched to using SolrJ, and I am trying to figure out which of two ways to use to index
this data. One would be to use one large SQL statement to get all of the data for a patient
set. The results would contain duplication due to the way tables are joined together that
I would need to sort out in the Java code, but that is doable.

The other way would be to

1.       Get all of the main document data with one SQL query, create index documents with
the data that they contain and store them in the index,

2.       Issue another SQL query that gets all of the addenda for all of the patients in the
data set and an id number for each one that tells which main document an addendum belongs
with, retrieve the main documents from the index, add the addenda fields to the document and
put them back in the index

3.       Do the same with diagnosis data.
It would be great to be able to keep the main document data that is retrieved from the database
in a hash table, update each of those objects with addenda and diagnoses, and write completely
filled out documents to the index once, but I don't have enough memory available to do this
for the patient sets I am working with now, and they want this indexing process to scale up
to patient sets that are ten times as large and eventually much larger than that.

Essentially for the second approach I am wondering if a Lucene index can be made to serve
as a hash table for storing intermediate results, and whether SolrJ has an API for retrieving
individual index documents so they can be updated. Basically it would be shifting from iterating
over SQL queries to iterating over Lucene index updates. If this way of doing things is also
likely to be slow, or the SolrJ API doesn't provide a way to do this, or there are other problems
with it, I can go with selecting all of the data in one large query and dealing with the duplication.
Thanks,
Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message