lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Liżewski <maciej.lizew...@3e.pl>
Subject RE: external values source
Date Mon, 22 Apr 2013 08:54:34 GMT
Hi Timothy,

Thank you for your answer - it is really helpful. Just to clarify - when using ValueSource
then flow is something like this:
- user sends query
- solr calls ValueSource to prepare values for every document (this part is cached in ExternalFileField
implementation I guess)
- solr runs query

And above flow is valid in every use case of ValueSource? There are no pre-calculated values,
etc (just asking to make it clear)? What caching scenario is recommended here to make sure
you won't end up with different cached entry for every query (I think I would follow the example
of ExternalFileField)?

Another thing is that in most cases array of values created in this process is rather sparse..
so I was thinking if there are no other solutions to store them with associatnion to documents
index...

--
Maciej Liżewski


-----Original Message-----
From: Timothy Potter [mailto:thelabdude@gmail.com] 
Sent: Saturday, April 20, 2013 2:02 AM
To: solr-user@lucene.apache.org
Subject: Re: external values source

Hi Maciek,

I think a custom ValueSource is definitely what you want because you need to compute some
derived value based on an indexed field and some external value.

The trick is figuring how to make the lookup to the external data very, very fast. Here's
a rough sketch of what we do:

We have a table in a database that contains a numeric value for a user and an organization,
such as query:

select num from table where userId='bob' and orgId=123 (similar to what you stated in question
#4)

On the Solr side, documents are indexed with user_id_s field, which is half of what I need
to do my lookup. The orgId is determined by the Solr client at query construction time, so
is passed to my custom ValueSource (aka function) in the query. In our app, users can be associated
with many different orgIds and changes frequently so we can't index the association.

To do the lookup to the database, we have a custom ValueSource, something like: dbLookup(user_id_s,
123)

(note: user_id_s is the name of the field holding my userID values in the index and 123 is
the orgId)

Behind the scenes, the ValueSource will have access to the user_id_s field values using FieldCache,
something like:

final BinaryDocValues dv =
FieldCache.DEFAULT.getTerms(reader.reader(), "user_id_s");

This gives us fast access to the user_id_s value for any given doc (question #1 above) So
now we can return an IntDocValues instance by
doing:

    @Override
    public FunctionValues getValues(Map context, AtomicReaderContext
reader) throws IOException {
        final BytesRef br = new BytesRef();
        final BinaryDocValues dv =
FieldCache.DEFAULT.getTerms(reader.reader(), fieldName);
        return new IntDocValues(this) {
            @Override
            public int intVal(int doc) {
                dv.get(doc,br);
                if (br.length == 0)
                    return 0;

                final String user_id_s = br.utf8ToString(); // the indexed userID for doc
                int val = 0;
                // todo: do custom lookup with orgID and user_id_s to compute int value for
doc
                return val;
            }
        }
        ...
    }

In this code, fieldName is set in the constructor (not shown) by parsing it out of the parameters,
something like:

            this.fieldName =
((org.apache.solr.schema.StrFieldSource)source).getField();

The user_id_s field comes into your ValueSource as a StrFieldSource (or whatever type you
use) ... here is how the ValueSource gets constructed at query time:

public class MyValueSourceParser extends ValueSourceParser {
    public void init(NamedList namedList) {}

    public ValueSource parse(FunctionQParser fqp) throws SyntaxError {
        return new MyValueSource(fqp.parseValueSource(), fqp.parseArg());
    }
}

There is one instance of your ValueSourceParser created per core. The parse method gets called
for every query that uses the ValueSource.

At query time, I might use the ValueSource to return this computed value in my fl list, such
as:

fl=id,looked_up:dbLookup(user_id_l,123),...

Or to sort by:

sort=dbLookup(user_id_s,123) desc

The data in our table doesn't change that frequently, so we export it to a flat file in S3
and our custom ValueSource downloads from S3, transforms it into an in-memory HashMap for
fast lookups. We thought about just issuing a query to load the data from the db directly
but we have many nodes and the query is expensive and result set is large so we didn't want
to hammer our database with N Solr nodes querying for the same data at roughly the same time.
So we do it once and post the compressed results to a shared location. The data in the table
is "sparse" as compared to the number of documents and userIds we have.

We simply poll S3 for changes every few minutes, which is good enough for us. This happens
from many nodes in a large Solr Cloud cluster running in EC2 so S3 works well for us as a
distribution mechanism.
Admittedly polling kind of sucks so we tried using Zookeeper to notify our custom watchers
when a znode changes but a ValueSource doesn't get notified when a core is reloaded so we
ended up having many weird issues with Zookeeper watchers in our custom ValueSource. For example,
new ValueSourceParsers get created when a core is reloaded but the previous instance doesn't
get notified that it's going out of service.
So this gives you an idea of how we load external data into a fast lookup data structure in
Solr (~question #2)

When filtering, we use PostFilter to tell Solr that our filter is expensive so should be applied
last (after all other criteria have run), something like:

fq={!frange l=2 u=8 cost=200 cache=false}dbLookup(user_id_s,123)

This computes a function range query using our custom ValueSource but tells Solr that it is
expensive (cost >= 100) so apply it after all other filters have been applied.
http://yonik.wordpress.com/tag/post-filter/

Lastly, as for speed, the user_id_s field gets loaded into FieldCache and the lookup happens
in a HashMap, albeit a large one. We're pretty happy with the performance of this function
and the only rub is that our lookup table has to fit in memory.

Hope this helps or at least gives you some info for further questions.

Cheers,
Tim

On Fri, Apr 19, 2013 at 4:03 PM, Maciej Liżewski <maciej.lizewski@3e.pl> wrote:
> I need some explanation on how ValuesSource and related classes work.
>
> There are already implemented ExternalFileField, example on how to 
> load data from database ( 
> <http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.
> html>
> http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-exte
> rnal.h
> tml)
>
> But they all fetch ALL data into memory which may consume large 
> amounts of this resource. Also documents are referenced by 'doc' integer value.
>
>
>
> My questions:
>
> 1)      Is the 'doc' value pointing to document in whole index? If so - how
> to get value of such documents field (for example: field named 'id')?
>
> 2)      Is there possibility to create ValuesSource, FieldType (or similar
> interface which will provide external data to sort and in query 
> results) which will work only on some subset of documents and use 
> external source capabilities to fetch document related data?
>
> 3)      How does it all work (memory consumption, hashtable access speed,
> etc), when there is a lot of documents in index (tens of millions for 
> example)?
>
> 4)      Are there any other examples on loading external data from database
> (I want to have numerical 'rate' from simple table having two columns:
> 'document unique key' string, 'rate' integer/float) which are not just 
> proof of concept but real-life examples?
>
>
>
> Any help and hints appreciated
>
> TIA
>
>
>
> --
>
> Maciek
>


Mime
View raw message