Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "Burton-West, Tom" <tburtonw@umich.edu>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Date: Fri, 15 Oct 2010 14:59:26 -0400
Subject: RE: filter query from external list of Solr unique IDs
Thread-Topic: filter query from external list of Solr unique IDs
Thread-Index: ActsgIJT38eK7Ow9QWytD9ehPi2ayAACuebtAANEV4A=
Message-ID: 
 <A902805D9E8F66428EACF947DDA9840C1032DF724B@ITCS-ECLS-1-VS3.adsroot.itcs.umich.edu>
References: 
 <A902805D9E8F66428EACF947DDA9840C1032DF71EA@ITCS-ECLS-1-VS3.adsroot.itcs.umich.edu>
 <90FF863A96E1EC42B8B240D04C88FB1D133E5E86D8@JHEMTEXVS2.win.ad.jhu.edu>
In-Reply-To: 
 <90FF863A96E1EC42B8B240D04C88FB1D133E5E86D8@JHEMTEXVS2.win.ad.jhu.edu>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi Jonathan,

The advantages of the obvious approach you outline are that it is simple, i=
t fits in to the existing Solr model, it doesn't require any customization =
or modification to Solr/Lucene java code.  Unfortunately, it does not scale=
 well.  We originally tried just what you suggest for our implementation of=
 Collection Builder.  For a user's personal collection we had a table that =
maps the collection id to the unique Solr ids.
Then when they wanted to search their collection, we just took their search=
 and added a filter query with the fq=3D(id:1 OR id:2 OR....).   I seem to =
remember running in to a limit on the number of OR clauses allowed. Even if=
 you can set that limit larger, there are a  number of efficiency issues. =
=20

We ended up constructing a separate Solr index where we have a multi-valued=
 collection number field. Unfortunately, until incremental field updating g=
ets implemented, this means that every time someone adds a document to a co=
llection, the entire document (including 700KB of OCR) needs to be re-index=
ed just to update the collection number field. This approach has allowed us=
 to scale up to a total of something under 100,000 documents, but we don't =
think we can scale it much beyond that for various reasons.

I was actually thinking of some kind of custom Lucene/Solr component that w=
ould for example take a query parameter such as &lookitUp=3D123 and the com=
ponent might do a JDBC query against a database or kv store and return resu=
lts in some form that would be efficient for Solr/Lucene to process. (Of co=
urse this assumes that a JDBC query would be more efficient than just sendi=
ng a long list of ids to Solr).  The other part of the equation is mapping =
the unique Solr ids to internal Lucene ids in order to implement a filter q=
uery.   I was wondering if something like the unique id to Lucene id mapper=
 in zoie might be useful or if that is too specific to zoie. SoThis may be =
totally off-base, since I haven't looked at the zoie code at all yet.

In our particular use case, we might be able to build some kind of in-memor=
y map after we optimize an index and before we mount it in production. In o=
ur workflow, we update the index and optimize it before we release it and o=
nce it is released to production there is no indexing/merging taking place =
on the production index (so the internal Lucene ids don't change.) =20

Tom


-----Original Message-----
From: Jonathan Rochkind [mailto:rochkind@jhu.edu]=20
Sent: Friday, October 15, 2010 1:07 PM
To: solr-user@lucene.apache.org
Subject: RE: filter query from external list of Solr unique IDs

Definitely interested in this.=20

The naive obvious approach would be just putting all the ID's in the query.=
 Like fq=3D(id:1 OR id:2 OR....).  Or making it another clause in the 'q'. =
=20

Can you outline what's wrong with this approach, to make it more clear what=
's needed in a solution?
________________________________________