lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michal Krajňanský <michal.krajnan...@gmail.com>
Subject Re: Lucene to Solrcloud migration
Date Tue, 11 Nov 2014 15:27:56 GMT
Hi Eric, Michael,

thank you both for your comments.

2014-11-11 5:05 GMT+01:00 Erick Erickson <erickerickson@gmail.com>:

> bq: - the documents are organized in "shards" according to date (integer)
> and
> language (a possibly extensible discrete set)
>
> bq: - the indexes are disjunct
>
> OK, I'm having a hard time getting my head around these two statements.
>
> If the indexes are disjunct in the sense that you only search one at a
> time,
> then they are different "collections" in SolrCloud jargon.
>
>
I just meant that every document is contained in a single one of the
indexes. I have a lot of Lucene indexes for various [language X timespan],
but logically we are speaking about a single huge index. That is why I
thought it would be natural to represent is as a single SolrCloud
collection.

If, on the other hand, these are a big collection and you want to search
> them all with a single query, I suggest that in SolrCloud land you don't
> want them to be discrete shards. My reasoning here is that let's say you
> have a bunch of documents for October, 2014 in Spanish. By putting these
> all on a single shard, your queries all have to be serviced by that one
> shard. You don't get any parallelism.
>
>
That is right. Actually the parallelization is not the main issue right
now. The queries are very sparse, currently our system does not support
load balancing at all. I imagined that in the future it could be achievable
via SolrCloud replication.

The main consideration is to be able to plug the indexes in and out on
demand. The total size of the data is in terabytes. We usually want to
search only the latest indexes but occassionally it is needed to plug in
one of the older ones.

Maybe (probably) I still have some misconceptions about the uses of
SolrCloud...

If it really does make sense in your case to route all the doc to a
> single shard,
> then Michael's comment is spot-on use compositeId router.
>
>
You confuse me here. I was not thinking about a single shard, on the
contrary, any [language X timespan] index would be itself a shard. I agree
that compositeId router seems to be natural for what I need. I am currently
searching for the way to convert my indexes in such way that my document
ID's have the composite format. Currently these are just unique integers,
so I would like to prefix all the document ID's of an index with it's
language and timespan. I do not know how, but I believe this should be
possible, as it is a constant operation that would not change the structure
of the index.

Best,

Michal



> Best,
> Erick
>
> On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta
> <michael.della.bitta@appinions.com> wrote:
> > Hi Michal,
> >
> > Is there a particular reason to shard your collections like that? If it
> was
> > mainly for ease of operations, I'd consider just using CompositeId to
> > prevent specific types of queries hotspotting particular nodes.
> >
> > If your ingest rate is fast, you might also consider making each
> > "collection" an alias that points to many actual collections, and
> > periodically closing off a collection and starting a new one. This
> prevents
> > cache churn and the impact of large merges.
> >
> > Michael
> >
> >
> >
> > On 11/10/14 08:03, Michal Krajňanský wrote:
> >>
> >> Hi All,
> >>
> >> I have been working on a project that has long employed Lucene indexer.
> >>
> >> Currently, the system implements a proprietary document routing and
> index
> >> plugging/unplugging on top of the Lucene and of course contains a great
> >> body of indexes. Recently an idea came up to migrate from Lucene to
> >> Solrcloud, which appears to be more powerfull that our proprietary
> system.
> >>
> >> Could you suggest the best way to seamlessly migrate the system to use
> >> Solrcloud, when the reindexing is not an option?
> >>
> >> - all the existing indexes represent a single collection in terms of
> >> Solrcloud
> >> - the documents are organized in "shards" according to date (integer)
> and
> >> language (a possibly extensible discrete set)
> >> - the indexes are disjunct
> >>
> >> I have been able to convert the existing indexes to the newest Lucene
> >> version and plug them individually into the Solrcloud. However, there is
> >> the question of routing, sharding etc.
> >>
> >> Any insight appreciated.
> >>
> >> Best,
> >>
> >>
> >> Michal Krajnansky
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message