Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of maciej.lizewski@3e.pl
 designates 62.181.5.76 as permitted sender)
From: =?UTF-8?Q?Maciej_Li=C5=BCewski?= <maciej.lizewski@3e.pl>
To: <solr-user@lucene.apache.org>
References: <5171bf7e.2a87440a.1b1a.7114SMTPIN_ADDED_BROKEN@mx.google.com>
 <CAJt9Wngw3aDE+Fa6ZT1wUyhGRSBxqvD90BikrKG5onxkg57wjw@mail.gmail.com>
In-Reply-To: 
 <CAJt9Wngw3aDE+Fa6ZT1wUyhGRSBxqvD90BikrKG5onxkg57wjw@mail.gmail.com>
Subject: RE: external values source
Date: Mon, 22 Apr 2013 10:54:34 +0200
Message-ID: <015a01ce3f37$03415ed0$09c41c70$@lizewski@3e.pl>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: Ac49WmWG+UAcHIQeSX+P+MW9ujh8AwB2tX0A
Content-Language: pl

Hi Timothy,

Thank you for your answer - it is really helpful. Just to clarify - when =
using ValueSource then flow is something like this:
- user sends query
- solr calls ValueSource to prepare values for every document (this part =
is cached in ExternalFileField implementation I guess)
- solr runs query

And above flow is valid in every use case of ValueSource? There are no =
pre-calculated values, etc (just asking to make it clear)? What caching =
scenario is recommended here to make sure you won't end up with =
different cached entry for every query (I think I would follow the =
example of ExternalFileField)?

Another thing is that in most cases array of values created in this =
process is rather sparse.. so I was thinking if there are no other =
solutions to store them with associatnion to documents index...

--
Maciej Li=C5=BCewski


-----Original Message-----
From: Timothy Potter [mailto:thelabdude@gmail.com]=20
Sent: Saturday, April 20, 2013 2:02 AM
To: solr-user@lucene.apache.org
Subject: Re: external values source

Hi Maciek,

I think a custom ValueSource is definitely what you want because you =
need to compute some derived value based on an indexed field and some =
external value.

The trick is figuring how to make the lookup to the external data very, =
very fast. Here's a rough sketch of what we do:

We have a table in a database that contains a numeric value for a user =
and an organization, such as query:

select num from table where userId=3D'bob' and orgId=3D123 (similar to =
what you stated in question #4)

On the Solr side, documents are indexed with user_id_s field, which is =
half of what I need to do my lookup. The orgId is determined by the Solr =
client at query construction time, so is passed to my custom ValueSource =
(aka function) in the query. In our app, users can be associated with =
many different orgIds and changes frequently so we can't index the =
association.

To do the lookup to the database, we have a custom ValueSource, =
something like: dbLookup(user_id_s, 123)

(note: user_id_s is the name of the field holding my userID values in =
the index and 123 is the orgId)

Behind the scenes, the ValueSource will have access to the user_id_s =
field values using FieldCache, something like:

final BinaryDocValues dv =3D
FieldCache.DEFAULT.getTerms(reader.reader(), "user_id_s");

This gives us fast access to the user_id_s value for any given doc =
(question #1 above) So now we can return an IntDocValues instance by
doing:

    @Override
    public FunctionValues getValues(Map context, AtomicReaderContext
reader) throws IOException {
        final BytesRef br =3D new BytesRef();
        final BinaryDocValues dv =3D
FieldCache.DEFAULT.getTerms(reader.reader(), fieldName);
        return new IntDocValues(this) {
            @Override
            public int intVal(int doc) {
                dv.get(doc,br);
                if (br.length =3D=3D 0)
                    return 0;

                final String user_id_s =3D br.utf8ToString(); // the =
indexed userID for doc
                int val =3D 0;
                // todo: do custom lookup with orgID and user_id_s to =
compute int value for doc
                return val;
            }
        }
        ...
    }

In this code, fieldName is set in the constructor (not shown) by parsing =
it out of the parameters, something like:

            this.fieldName =3D
((org.apache.solr.schema.StrFieldSource)source).getField();

The user_id_s field comes into your ValueSource as a StrFieldSource (or =
whatever type you use) ... here is how the ValueSource gets constructed =
at query time:

public class MyValueSourceParser extends ValueSourceParser {
    public void init(NamedList namedList) {}

    public ValueSource parse(FunctionQParser fqp) throws SyntaxError {
        return new MyValueSource(fqp.parseValueSource(), =
fqp.parseArg());
    }
}

There is one instance of your ValueSourceParser created per core. The =
parse method gets called for every query that uses the ValueSource.

At query time, I might use the ValueSource to return this computed value =
in my fl list, such as:

fl=3Did,looked_up:dbLookup(user_id_l,123),...

Or to sort by:

sort=3DdbLookup(user_id_s,123) desc

The data in our table doesn't change that frequently, so we export it to =
a flat file in S3 and our custom ValueSource downloads from S3, =
transforms it into an in-memory HashMap for fast lookups. We thought =
about just issuing a query to load the data from the db directly but we =
have many nodes and the query is expensive and result set is large so we =
didn't want to hammer our database with N Solr nodes querying for the =
same data at roughly the same time. So we do it once and post the =
compressed results to a shared location. The data in the table is =
"sparse" as compared to the number of documents and userIds we have.

We simply poll S3 for changes every few minutes, which is good enough =
for us. This happens from many nodes in a large Solr Cloud cluster =
running in EC2 so S3 works well for us as a distribution mechanism.
Admittedly polling kind of sucks so we tried using Zookeeper to notify =
our custom watchers when a znode changes but a ValueSource doesn't get =
notified when a core is reloaded so we ended up having many weird issues =
with Zookeeper watchers in our custom ValueSource. For example, new =
ValueSourceParsers get created when a core is reloaded but the previous =
instance doesn't get notified that it's going out of service.
So this gives you an idea of how we load external data into a fast =
lookup data structure in Solr (~question #2)

When filtering, we use PostFilter to tell Solr that our filter is =
expensive so should be applied last (after all other criteria have run), =
something like:

fq=3D{!frange l=3D2 u=3D8 cost=3D200 =
cache=3Dfalse}dbLookup(user_id_s,123)

This computes a function range query using our custom ValueSource but =
tells Solr that it is expensive (cost >=3D 100) so apply it after all =
other filters have been applied.
http://yonik.wordpress.com/tag/post-filter/

Lastly, as for speed, the user_id_s field gets loaded into FieldCache =
and the lookup happens in a HashMap, albeit a large one. We're pretty =
happy with the performance of this function and the only rub is that our =
lookup table has to fit in memory.

Hope this helps or at least gives you some info for further questions.

Cheers,
Tim

On Fri, Apr 19, 2013 at 4:03 PM, Maciej Li=C5=BCewski =
<maciej.lizewski@3e.pl> wrote:
> I need some explanation on how ValuesSource and related classes work.
>
> There are already implemented ExternalFileField, example on how to=20
> load data from database (=20
> =
<http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-extern=
al.
> html>
> http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-exte
> rnal.h
> tml)
>
> But they all fetch ALL data into memory which may consume large=20
> amounts of this resource. Also documents are referenced by 'doc' =
integer value.
>
>
>
> My questions:
>
> 1)      Is the 'doc' value pointing to document in whole index? If so =
- how
> to get value of such documents field (for example: field named 'id')?
>
> 2)      Is there possibility to create ValuesSource, FieldType (or =
similar
> interface which will provide external data to sort and in query=20
> results) which will work only on some subset of documents and use=20
> external source capabilities to fetch document related data?
>
> 3)      How does it all work (memory consumption, hashtable access =
speed,
> etc), when there is a lot of documents in index (tens of millions for=20
> example)?
>
> 4)      Are there any other examples on loading external data from =
database
> (I want to have numerical 'rate' from simple table having two columns:
> 'document unique key' string, 'rate' integer/float) which are not just =

> proof of concept but real-life examples?
>
>
>
> Any help and hints appreciated
>
> TIA
>
>
>
> --
>
> Maciek
>