Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D098C1029C for ; Mon, 22 Apr 2013 08:54:59 +0000 (UTC) Received: (qmail 40623 invoked by uid 500); 22 Apr 2013 08:54:55 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 40578 invoked by uid 500); 22 Apr 2013 08:54:55 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 40562 invoked by uid 99); 22 Apr 2013 08:54:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Apr 2013 08:54:55 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=MSGID_MULTIPLE_AT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of maciej.lizewski@3e.pl designates 62.181.5.76 as permitted sender) Received: from [62.181.5.76] (HELO mx.3e.pl) (62.181.5.76) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Apr 2013 08:54:48 +0000 Received: from krycekdell (83-144-77-250.static.chello.pl [83.144.77.250]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: maciej.lizewski@3e.pl) by mx.3e.pl (Postfix) with ESMTPSA id EF4C21FEC029 for ; Mon, 22 Apr 2013 10:54:27 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=3e.pl; s=dkim; t=1366620868; bh=EpY9lk6X8VFMztIOl+7zJpBYs9ODj3DbM6CM0xQ+aLQ=; h=From:To:References:In-Reply-To:Subject:Date: Content-Transfer-Encoding; b=ZrQX868Z9hpcPvIpyLg+bFzHw8LwYynXMZsQhUmmU1N24Skvs20glUr1m+WO2w1L0 K1IKHeIw5mS+Z5sKbMNtlKAAQ+ENYL6PHJfYKqv7POc8oS26XVR3WtZ2xU3s4tCGfc B3uS9k/+Uf6mkYV9EUlhZQ+jrrfbAxh9aIu8yb6o= From: =?UTF-8?Q?Maciej_Li=C5=BCewski?= To: References: <5171bf7e.2a87440a.1b1a.7114SMTPIN_ADDED_BROKEN@mx.google.com> In-Reply-To: Subject: RE: external values source Date: Mon, 22 Apr 2013 10:54:34 +0200 Message-ID: <015a01ce3f37$03415ed0$09c41c70$@lizewski@3e.pl> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac49WmWG+UAcHIQeSX+P+MW9ujh8AwB2tX0A Content-Language: pl X-Virus-Checked: Checked by ClamAV on apache.org Hi Timothy, Thank you for your answer - it is really helpful. Just to clarify - when = using ValueSource then flow is something like this: - user sends query - solr calls ValueSource to prepare values for every document (this part = is cached in ExternalFileField implementation I guess) - solr runs query And above flow is valid in every use case of ValueSource? There are no = pre-calculated values, etc (just asking to make it clear)? What caching = scenario is recommended here to make sure you won't end up with = different cached entry for every query (I think I would follow the = example of ExternalFileField)? Another thing is that in most cases array of values created in this = process is rather sparse.. so I was thinking if there are no other = solutions to store them with associatnion to documents index... -- Maciej Li=C5=BCewski -----Original Message----- From: Timothy Potter [mailto:thelabdude@gmail.com]=20 Sent: Saturday, April 20, 2013 2:02 AM To: solr-user@lucene.apache.org Subject: Re: external values source Hi Maciek, I think a custom ValueSource is definitely what you want because you = need to compute some derived value based on an indexed field and some = external value. The trick is figuring how to make the lookup to the external data very, = very fast. Here's a rough sketch of what we do: We have a table in a database that contains a numeric value for a user = and an organization, such as query: select num from table where userId=3D'bob' and orgId=3D123 (similar to = what you stated in question #4) On the Solr side, documents are indexed with user_id_s field, which is = half of what I need to do my lookup. The orgId is determined by the Solr = client at query construction time, so is passed to my custom ValueSource = (aka function) in the query. In our app, users can be associated with = many different orgIds and changes frequently so we can't index the = association. To do the lookup to the database, we have a custom ValueSource, = something like: dbLookup(user_id_s, 123) (note: user_id_s is the name of the field holding my userID values in = the index and 123 is the orgId) Behind the scenes, the ValueSource will have access to the user_id_s = field values using FieldCache, something like: final BinaryDocValues dv =3D FieldCache.DEFAULT.getTerms(reader.reader(), "user_id_s"); This gives us fast access to the user_id_s value for any given doc = (question #1 above) So now we can return an IntDocValues instance by doing: @Override public FunctionValues getValues(Map context, AtomicReaderContext reader) throws IOException { final BytesRef br =3D new BytesRef(); final BinaryDocValues dv =3D FieldCache.DEFAULT.getTerms(reader.reader(), fieldName); return new IntDocValues(this) { @Override public int intVal(int doc) { dv.get(doc,br); if (br.length =3D=3D 0) return 0; final String user_id_s =3D br.utf8ToString(); // the = indexed userID for doc int val =3D 0; // todo: do custom lookup with orgID and user_id_s to = compute int value for doc return val; } } ... } In this code, fieldName is set in the constructor (not shown) by parsing = it out of the parameters, something like: this.fieldName =3D ((org.apache.solr.schema.StrFieldSource)source).getField(); The user_id_s field comes into your ValueSource as a StrFieldSource (or = whatever type you use) ... here is how the ValueSource gets constructed = at query time: public class MyValueSourceParser extends ValueSourceParser { public void init(NamedList namedList) {} public ValueSource parse(FunctionQParser fqp) throws SyntaxError { return new MyValueSource(fqp.parseValueSource(), = fqp.parseArg()); } } There is one instance of your ValueSourceParser created per core. The = parse method gets called for every query that uses the ValueSource. At query time, I might use the ValueSource to return this computed value = in my fl list, such as: fl=3Did,looked_up:dbLookup(user_id_l,123),... Or to sort by: sort=3DdbLookup(user_id_s,123) desc The data in our table doesn't change that frequently, so we export it to = a flat file in S3 and our custom ValueSource downloads from S3, = transforms it into an in-memory HashMap for fast lookups. We thought = about just issuing a query to load the data from the db directly but we = have many nodes and the query is expensive and result set is large so we = didn't want to hammer our database with N Solr nodes querying for the = same data at roughly the same time. So we do it once and post the = compressed results to a shared location. The data in the table is = "sparse" as compared to the number of documents and userIds we have. We simply poll S3 for changes every few minutes, which is good enough = for us. This happens from many nodes in a large Solr Cloud cluster = running in EC2 so S3 works well for us as a distribution mechanism. Admittedly polling kind of sucks so we tried using Zookeeper to notify = our custom watchers when a znode changes but a ValueSource doesn't get = notified when a core is reloaded so we ended up having many weird issues = with Zookeeper watchers in our custom ValueSource. For example, new = ValueSourceParsers get created when a core is reloaded but the previous = instance doesn't get notified that it's going out of service. So this gives you an idea of how we load external data into a fast = lookup data structure in Solr (~question #2) When filtering, we use PostFilter to tell Solr that our filter is = expensive so should be applied last (after all other criteria have run), = something like: fq=3D{!frange l=3D2 u=3D8 cost=3D200 = cache=3Dfalse}dbLookup(user_id_s,123) This computes a function range query using our custom ValueSource but = tells Solr that it is expensive (cost >=3D 100) so apply it after all = other filters have been applied. http://yonik.wordpress.com/tag/post-filter/ Lastly, as for speed, the user_id_s field gets loaded into FieldCache = and the lookup happens in a HashMap, albeit a large one. We're pretty = happy with the performance of this function and the only rub is that our = lookup table has to fit in memory. Hope this helps or at least gives you some info for further questions. Cheers, Tim On Fri, Apr 19, 2013 at 4:03 PM, Maciej Li=C5=BCewski = wrote: > I need some explanation on how ValuesSource and related classes work. > > There are already implemented ExternalFileField, example on how to=20 > load data from database (=20 > = html> > http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-exte > rnal.h > tml) > > But they all fetch ALL data into memory which may consume large=20 > amounts of this resource. Also documents are referenced by 'doc' = integer value. > > > > My questions: > > 1) Is the 'doc' value pointing to document in whole index? If so = - how > to get value of such documents field (for example: field named 'id')? > > 2) Is there possibility to create ValuesSource, FieldType (or = similar > interface which will provide external data to sort and in query=20 > results) which will work only on some subset of documents and use=20 > external source capabilities to fetch document related data? > > 3) How does it all work (memory consumption, hashtable access = speed, > etc), when there is a lot of documents in index (tens of millions for=20 > example)? > > 4) Are there any other examples on loading external data from = database > (I want to have numerical 'rate' from simple table having two columns: > 'document unique key' string, 'rate' integer/float) which are not just = > proof of concept but real-life examples? > > > > Any help and hints appreciated > > TIA > > > > -- > > Maciek >