Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 54524E0DF for ; Thu, 3 Jan 2013 14:25:33 +0000 (UTC) Received: (qmail 17920 invoked by uid 500); 3 Jan 2013 14:25:30 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 17848 invoked by uid 500); 3 Jan 2013 14:25:29 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 17837 invoked by uid 99); 3 Jan 2013 14:25:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jan 2013 14:25:29 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [207.57.124.128] (HELO ontrenet.com) (207.57.124.128) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jan 2013 14:25:25 +0000 Received: (qmail 3797 invoked by uid 31872); 3 Jan 2013 14:25:04 -0000 Message-ID: <20130103142504.3796.qmail@ontrenet.com> Content-Disposition: inline Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="US-ASCII"; format="flowed" MIME-Version: 1.0 X-Mailer: MIME::Lite 3.0104 (F2.72; T1.15; A1.47; B3.01; Q3.01) Date: Thu, 3 Jan 2013 14:25:04 +0000 From: "Darren Govoni" To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... X-Virus-Checked: Checked by ClamAV on apache.org Ah, ok. Good. Makes sense. I think I will draw all this up in a UML that includes the distinction between the "logical" terms and the "physical" terms (and their mapping) as they do get intertwined. I'll post it here when I'm done.

------- Original Message ------- On 1/3/2013 09:19 AM Jack Krupansky wrote:
A single shard MAY exist on a single core, but only if it is not replicated.
Generally, a single shard will exist on multiple cores, each a replica of
the source data as it comes into the update handler.

-- Jack Krupansky

-----Original Message-----
From: Darren Govoni
Sent: Thursday, January 03, 2013 9:10 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a
single shard exist only on a single core?

------- Original Message -------
On 1/3/2013 09:03 AM Jack Krupansky wrote:
No, a shard is a subset (or
"slice") of the collection. Sharding is a way of

"slicing" the original data, before we talk about how the shards get
stored

and replicated on actual Solr cores. Replicas are instances of the data
for

a shard.

Sometimes people may loosely speak of a replica as being "a shard", but

that's just loose use of the terminology.

So, we're not "sharding shards", but we are "replicating shards".

-- Jack Krupansky

-----Original Message-----

From: Darren Govoni

Sent: Thursday, January 03, 2013 8:51 AM

To: solr-user@lucene.apache.org

Subject: RE: Re: Terminology question: Core vs. Collection vs...

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:

Collection is the more modern term and incorporates the fact that
the

collection may be sharded, with each shard on one or more cores,
with

each
core being a replica of the other cores within that shard of
that

collection.

A collection is sharded, meaning it is distributed across cores. A shard

itself is not distributed across cores in the same since. Rather a shard

exist on a single core and is replicated on other cores. Is that right?
The

way its worded above, it sounds like a shard can also be sharded...

------- Original Message -------

On 1/3/2013 08:28 AM Jack Krupansky wrote:
A node is a machine in a

cluster or cloud (graph). It could be a real

machine or a virtualized machine. Technically, you could have
multiple

virtual nodes on the same physical "box". Each Solr replica would be
on

a

different node.

Technically, you could have multiple Solr instances running on a
single

hardware node, each with a different port. They are simply instances
of

Solr, although you could consider each Solr instance a node in a
Solr

cloud

as well, a "virtual" node. So, technically, you could have multiple

replicas

on the same node, but that sort of defeats most of the purpose of
having

replicas in the first place - to distribute the data for performance
and

fault tolerance. But, you could have replicas of different shards on
the

same node/box for a partial improvement of performance and fault

tolerance.

A Solr "cloud' is really a cluster.

-- Jack Krupansky

-----Original Message-----

From: Darren Govoni

Sent: Thursday, January 03, 2013 8:16 AM

To: solr-user@lucene.apache.org

Subject: RE: Re: Terminology question: Core vs. Collection vs...

Good write up.

And what about "node"?

I think there needs to be an official glossary of terms that is

sanctioned

by the solr team and some terms still ni use may need to be labeled

"deprecated". After so many years, its still confusing.

------- Original Message -------

On 1/3/2013 08:07 AM Jack Krupansky wrote:
Collection is the
more

modern

term and incorporates the fact that the

collection may be sharded, with each shard on one or more cores,

with

each

core being a replica of the other cores within that shard of
that

collection.

Instance is a general term, but is commonly used to refer to a

running

Solr

server, each of which can service any number of cores. A sharded

collection

would typically require multiple instances of Solr, each with a

shard of

the

collection.

Multiple collections can be supported on a single instance of
Solr.

They

don't have to be sharded or replicated. But if they are, each
Solr

instance

will have a copy or replica of the data (index) of one shard of
each

sharded

collection - to the degree that each collection needs that many

shards.

At the API level, you talk to a Solr instance, using a host and

port,

and

giving the collection name. Some operations will refer only to
the

portion

of a multi-shard collection on that Solr instance, but typically

Solr

will

"distribute" the operation, whether it be an update or a query,
to

all

of

the shards of the named collection. In the case of update, the

update

will

be distributed to all replicas as well, but in the case of query

only

one

replica of each shard of the collection is needed.

Before SolrCloud we Solr had master and slave and the slaves
were

replicas

of the master, but with SolrCloud there is no master and all the

replicas of

the shard are peers, although at any moment of time one of them
will

be

considered the "leader" for coordination purposes, but not in
the

sense

that

it is a master of the other replicas in that shard. A SolrCloud

replica

is a

replica of the data, in an abstract sense, for a single shard of
a

collection. A SolrCloud replica is more of an instance of the

data/index.

An index exists at two levels: the portion of a collection on a

single

Solr

core will have a Lucene index, but collectively the Lucene
indexes

for

the

shards of a collection can be referred to the index of the

collection.

Each

replica is a copy or instance of a portion of the collection's

index.

The term slice is sometimes used to refer collectively to all of
the

cores/replicas of a single shard, or sometimes to a single
replica

as it

contains only a "slice" of the full collection data.

-- Jack Krupansky

-----Original Message-----

From: Alexandre Rafalovitch

Sent: Thursday, January 03, 2013 4:42 AM

To: solr-user@lucene.apache.org

Subject: Terminology question: Core vs. Collection vs...

Hello,

I am trying to understand the core Solr terminology. I am
looking

for

correct rather than loose meaning as I am trying to teach an
example

that

starts from easy scenario and may scale to multi-core,
multi-machine

situation.

Here are the terms that seem to be all overlapping and/or
crossing

over

in

my mind a the moment.

1) Index

2) Core

3) Collection

4) Instance

5) Replica (Replica of _what_?)

6) Others?

I tried looking through documentation, but either there is a

terminology

drift or I am having trouble understanding the distinctions.

If anybody has a clear picture in their mind, I would appreciate
a

clarification.

Regards,

Alex.

Personal blog: http://blog.outerthoughts.com/

LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch

- Time is the quality of nature that keeps events from happening
all

at

once. Lately, it doesn't seem to be working. (Anonymous - via
GTD

book)