lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Gifford <jon.giff...@gmail.com>
Subject SolrCloud - Using collections, slices and shards in the wild
Date Wed, 10 Feb 2010 05:02:39 GMT
I've been following the progress of the SolrCloud branch closely, and
wanted to explain how I intend to use it, and what that means for how
the collections, slices and shards could work.

I should say up front that Mark, Yonik and I have exchanged a few
emails on this already, and Mark suggested that I switch to this list
to "drum up more interest in others playing with the branch and
chiming in with thoughts." I realize the code is still at a very early
stage, so this is really intended to be more grist for the mill, not a
criticism of the current implementation, whih seems to me to be a very
nice base for what I need to do. Also, I'll apologize up front for the
length of this email, but I wanted to paint as clear a picture as I
could of how I intend to use this stuff.

The system I'll be building needs to be able to:

1) Support one index per customer, and many customers (thus, many
independent indices)

2) Share the same schema across all indicies

3) Allow for time-based shards within a single customers index.

4) as an added twist, some customers will be sending data faster than
can be indexed in a single core, so we'll also need to split the input
stream to multiple cores. Thus, for a given time-based shard, we're
likely to have multiple parallel indexers building independent shards.

Mapping these requirements to the current state of SolrCloud, I could
use a single collection (i.e. a single schema) that all customer
indicies are part of, then create slices of that collection to
represent an individual customers index, each made up of a set of time
based shards, which may themselves be built in parallel on independent
cores..

Alternatively, I could create a collection per customer, which removes
the need for slices, but means duplicating the schema many times. From
an operational standpoint, a single collection makes more sense to me.

The current state of the branch allows me to do some, but not all, of
what I need to do, and I wanted to walk through how I could see myself
using it

Firstly, I'd like to be able to use the REST interface to create new
cores/shards - I'm not going to bet that this is what the final system
will do, but for the stage I'm at now, its the simplest, quickest way
to get going. The current code uses the core name as the collection
name, which won't work for me if I use a single collection. For
example, if I want to create a new core for customer_1 for todays
index, I'd do the following:

    http://localhost:8983/solr/admin/cores?action=CREATE&instanceDir=.&name=collection_1&dataDir=data/customer_1_20100209

This approach is going to lead to a lot of solr instances ;-)

Revising the code to use the core name as a slice, I'd get:

    http://localhost:8983/solr/admin/cores?action=CREATE&instanceDir=.&name=customer_1&dataDir=data/customer_1_20100209

but would need to explicitly add a collection=collection_1 parameter
to the call to make sure it uses the correct collection. The problem
with this approach is that I'm now limited to only being able deliver
one shard per customer from each Solr instance.

Revising again, to use the core name as the shard name, I'd get:

    http://localhost:8983/solr/admin/cores?action=CREATE&instanceDir=.&name=customer_1_20100209&dataDir=data/customer_1_201-0209

and would need explicit collection= and slice= parameters. This is the
ideal situation, because I can run as many hards from the same
customer as I like on a single Solr instance.

So, essentially what I'm saying is that cores and shards really are
identical, and when a core is created, we should be able to specify
the collection and slice that they belong to, via the REST interface.

Here's Marks' comments on this...

> I think we simply haven't thought much about creating cores dynamically
> with http requests yet. You can set a custom shard id initially in the
> solr.xml, or using the CloudDescriptor on the CoreDescriptor when doing
> it programmatically.
>
> Its a good issue to bring up  - I think we will want support to handle
> this stuff with the core admin handler. I can add the basics pretty soon
> I think.
>
> The way things default now (core name is collection) is really only for
> simple bootstrap situations.

and

> Yeah, I think you can do quite a bit with it now, but there is def still
> a lot planned. We are actually working on polishing off what we have as
> a first plateau now. We have mostly been working from either static
> configuration and/or java code in building it up though, so personally
> it hadn't even yet hit me to take care of the HTTP CoreAdmin side of
> things. From a dev side, I just havn't had to use it much, so when I
> think dynamic cores I'm usually thinking java code style.

The second part of what I need is to be able to search a single
customers index, which I'm assuming will be a slice. Something like:

    http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1

would do the trick, assuming we have the slice => shards mapping available.

Reading over some of the previous discussions, slices seem to be
somewhat contentious, and I wanted to chime in on them a bit here. It
seems to me that slices are loosely defined, and I think thats a good
thing. If you think of slices as being similar to tags, then its easy
to imagine that any given shard can belong to many different slices.
For example,

*  I'm going to create a slice per customer, but I don't think there
should be anything stopping me from creating subslices of that slice -
maybe I have a "customer_1_mostRecentWeek" slice, that is the last 7
days worth of daily shards from the customer_1 slice.

*  I may want to build combined indices for some set of customers
(customer_2 through customer_10, say). If I do that, it may still be
desirable (to simplify the front end system that is using Solr)  to
create individual customer_2, customer_3, ..., customer_10 slices,
even though they all map to the same shards

* I may want to combine results from two different customers. It may
be easier to just define a third slice that is the union of the two
customers slices.

One advantage of using slices in this fairly promiscuous fashion is
that it may simplify some of the more interesting use-cases. If we
know that there is an automatically maintained "mostRecentWeek" slice
available, for example, then some of the more complex shard selection
logic can be moved out of the query-time processing entirely.

Enough writing for now, on to some experimentation :-)

Mime
View raw message