lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "SolrCloudPlanning" by Mark Miller
Date Fri, 03 Feb 2012 16:07:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "SolrCloudPlanning" page has been changed by Mark Miller:
http://wiki.apache.org/solr/SolrCloudPlanning?action=diff&rev1=44&rev2=45

  ## page was renamed from SolrCloud
  = SolrCloud =
- SolrCloud is the set of Solr features that take Solr's distributed search to the next level,
enabling and simplifying the creation and use of Solr clusters.
  
-  * Central configuration for the entire cluster
-  * Automatic load balancing and fail-over for queries
-  * Cluster state and layout stored in central system.
- 
- Zookeeper is integrated and used to coordinate and store the configuration of the cluster.
- 
- That is what has been done so far on trunk.
- 
- A second initiative has recently begun to finish the distributed indexing side of SolrCloud.
See https://issues.apache.org/jira/browse/SOLR-2358
- 
- <<TableOfContents(3)>>
- 
- == Getting Started ==
- Check out and build the trunk: https://svn.apache.org/repos/asf/lucene/dev/trunk and build
the example server with {{{cd solr; ant example}}}.
- 
- If you haven't yet, go through the simple [[http://lucene.apache.org/solr/tutorial.html|Solr
Tutorial]] to familiarize yourself with Solr.
- 
- Solr embeds and uses Zookeeper as a repository for cluster configuration and coordination
- think of it as a distributed filesystem that contains information about all of the Solr
servers.
- 
- === Example A: Simple two shard cluster ===
- This example simply creates a cluster consisting of two solr servers representing two different
shards of a collection.
- 
- Since we'll need two solr servers for this example, simply make a copy of the example directory
for the second server.
- 
- {{{
- cp -r example example2
- }}}
- This command starts up a Solr server and bootstraps a new solr cluster.
- 
- {{{
- cd example
- java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar
- }}}
-  * {{{-DzkRun}}} causes an embedded zookeeper server to be run as part of this Solr server.
-  * {{{-Dbootstrap_confdir=./solr/conf}}} Since we don't yet have a config in zookeeper,
this parameter causes the local configuration directory {{{./solr/conf}}} to be uploaded as
the "myconf" config.  The name "myconf" is taken from the "collection.configName" param below.
-  * {{{-Dcollection.configName=myconf}}} sets the config to use for the new collection.
- 
- Browse to http://localhost:8983/solr/collection1/admin/zookeeper.jsp to see the state of
the cluster (the zookeeper distributed filesystem).
- 
- You can see from the zookeeper browser that the Solr configuration files were uploaded under
"myconf", and that a new document collection called "collection1" was created.  Under collection1
is a list of shards, the pieces that make up the complete collection.
- 
- Now we want to start up our second server, assigning it a different shard, or piece of the
collection. Simply change the shardId parameter for the appropriate solr core in solr.xml:
- 
- {{{
- cd example2
- sed -ibak 's/shard1/shard2/g' solr/solr.xml
- #OR perl -pi -e 's/shard1/shard2/g' solr/solr.xml
- #note: if you don't have sed or perl installed, you can simply hand edit solr.xml, changing
shard1 to shard2
- }}}
- Then start the second server, pointing it at the cluster:
- 
- {{{
- java -Djetty.port=7574 -DhostPort=7574 -DzkHost=localhost:9983 -jar start.jar
- }}}
-  * {{{-Djetty.port=7574}}}  is just one way to tell the Jetty servlet container to use a
different port.
-  * {{{-DhostPort=7574}}} tells Solr what port the servlet container is running on.
-  * {{{-DzkHost=localhost:9983}}} points to the Zookeeper ensemble containing the cluster
state.  In this example we're running a single Zookeeper server embedded in the first Solr
server.  By default, an embedded Zookeeper server runs at the Solr port plus 1000, so 9983.
- 
- If you refresh the zookeeper browser, you should now see both shard1 and shard2 in collection1.
- 
- Next, index some documents to each server:
- 
- {{{
- cd exampledocs
- java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar ipod_video.xml
- java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar monitor.xml
- }}}
- And now, a request to either server with "distrib=true" results in a distributed search
that covers the entire collection:
- 
- http://localhost:8983/solr/collection1/select?distrib=true&q=*:*
- 
- If at any point you wish to start over fresh or experiment with different configurations,
you can delete all of the cloud state contained within zookeeper by simply deleting the solr/zoo_data
directory after shutting down the servers.
- 
- === Example B: Simple two shard cluster with shard replicas ===
- This example will simply build off of the previous example by creating another copy of shard1
and shard2.  Extra shard copies can be used for high availability and fault tolerance, or
simply for increasing the query capacity of the cluster.
- 
- Note: This setup leverages copy/paste to setup 2 cores per shard and distributed searches
validate a succesful completion of this example/exercise. But DO NOT assume that any new data
that you index will be distributed across and indexed at each core of a given shard. That
will not happen. Distributed Indexing is not part of SolrCloud yet. You may however adapt
a basic implementation of distributed indexing by referring to [[https://issues.apache.org/jira/browse/SOLR-2355|SOLR-2355]].
- 
- First, run through the previous example so we already have two shards and some documents
indexed into each.  Then simply make a copy of those two servers:
- 
- {{{
- cp -r example exampleB
- cp -r example2 example2B
- }}}
- Then start the two new servers on different ports, each in its own window:
- 
- {{{
- cd exampleB
- java -Djetty.port=8900 -DhostPort=8900 -DzkHost=localhost:9983 -jar start.jar
- }}}
- {{{
- cd example2B
- java -Djetty.port=7500 -DhostPort=7500 -DzkHost=localhost:9983 -jar start.jar
- }}}
- Refresh the zookeeper browser page http://localhost:8983/solr/admin/zookeeper.jsp and verify
that 4 solr nodes are up, and that each shard is present at 2 nodes.
- 
- Now send a query to any of the servers to query the cluster:
- 
- http://localhost:7500/solr/collection1/select?distrib=true&q=*:*
- 
- Send this query multiple times and observe the logs from the solr servers.  From your web
browser, you may need to hold down CTRL while clicking on the browser refresh button to bypass
the HTTP caching in your browser.  You should be able to observe Solr load balancing the requests
(done via LBHttpSolrServer ?) across shard replicas, using different servers to satisfy each
request.  There will be a log statement for the top-level request in the server the browser
sends the request to, and then a log statement for each sub-request that are merged to produce
the complete response.
- 
- To demonstrate fail over for high availability, go ahead and kill any one of the Solr servers
(just press CTRL-C in the window running the server) and and send another query request to
any of the remaining servers that are up.
- 
- === Example C: Two shard cluster with shard replicas and zookeeper ensemble ===
- The problem with example B is that while there are enough Solr servers to survive any one
of them crashing, there is only one zookeeper server that contains the state of the cluster.
 If that zookeeper server crashes, distributed queries will still work since the solr servers
remember the state of the cluster last reported by zookeeper.  The problem is that no new
servers or clients will be able to discover the cluster state, and no changes to the cluster
state will be possible.
- 
- Running multiple zookeeper servers in concert (a zookeeper ensemble) allows for high availability
of the zookeeper service.  Every zookeeper server needs to know about every other zookeeper
server in the ensemble, and a majority of servers are needed to provide service.  For example,
a zookeeper ensemble of 3 servers allows any one to fail with the remaining 2 constituting
a majority to continue providing service.  5 zookeeper servers are needed to allow for the
failure of up to 2 servers at a time.
- 
- For production, it's recommended that you run an external zookeeper ensemble rather than
having Solr run embedded zookeeper servers.  For this example, we'll use the embedded servers
for simplicity.
- 
- First, stop all 4 servers and then clean up the zookeeper data directories for a fresh start.
- 
- {{{
- rm -r example*/solr/zoo_data
- }}}
- We will be running the servers again at ports 8983,7574,8900,7500.  The default is to run
an embedded zookeeper server at hostPort+1000, so if we run an embedded zookeeper on the first
three servers, the ensemble address will be {{{localhost:9983,localhost:8574,localhost:9900}}}.
- 
- As a convenience, we'll have the first server upload the solr config to the cluster.  You
will notice it block until you have actually started the second server.  This is due to zookeeper
needing a quorum before it can operate.
- 
- {{{
- cd example
- java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900
 -jar start.jar
- }}}
- {{{
- cd example2
- java -Djetty.port=7574 -DhostPort=7574 -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900
-jar start.jar
- }}}
- {{{
- cd exampleB
- java -Djetty.port=8900 -DhostPort=8900 -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900
-jar start.jar
- }}}
- {{{
- cd example2B
- java -Djetty.port=7500 -DhostPort=7500 -DzkHost=localhost:9983,localhost:8574,localhost:9900
-jar start.jar
- }}}
- Now since we are running three embedded zookeeper servers as an ensemble, everything can
keep working even if a server is lost. To demonstrate this, kill the exampleB server by pressing
CTRL+C in it's window and then browse to http://localhost:8983/solr/admin/zookeeper.jsp to
verify that the zookeeper service still works.
- 
- == ZooKeeper ==
- Multiple Zookeeper servers running together for fault tolerance and high availability is
called an ensemble.  For production, it's recommended that you run an external zookeeper ensemble
rather than having Solr run embedded servers.  See the [[http://zookeeper.apache.org/|Apache
ZooKeeper]] site for more information on downloading and running a zookeeper ensemble.
- 
- When Solr runs an embedded zookeeper server, it defaults to using the solr port plus 1000
for the zookeeper client port.  In addition, it defaults to adding one to the client port
for the zookeeper server port, and two for the zookeeper leader election port.  So in the
first example with Solr running at 8983, the embedded zookeeper server used port 9983 for
the client port and 9984,9985 for the server ports.
- 
- == Creating cores via CoreAdmin ==
- New Solr cores may also be created and associated with a collection via CoreAdmin.
- 
- Additional cloud related parameters for the CREATE action:
- 
-  * '''collection''' - the name of the collection this core belongs to.  Default is the name
of the core.
-  * '''shard''' - the shard id this core represents
-  * '''collection.<param>=<value>''' - causes a property of <param>=<value>
to be set if a new collection is being created.
-   * Use  collection.configName=<configname> to point to the config for a new collection.
- 
- Example:
- 
- {{{
- curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collection1&shard=shard2'
- }}}
- == Distributed Requests ==
- Explicitly specify the addresses of shards you want to query:
- 
- {{{
- shards=localhost:8983/solr,localhost:7574/solr
- }}}
- Explicitly specify the addresses of shards you want to query, giving alternatives (delimited
by `|`) used for load balancing and fail-over:
- 
- {{{
- shards=localhost:8983/solr|localhost:8900/solr,localhost:7574/solr|localhost:7500/solr
- }}}
- Query all shards of a collection (the collection is implicit in the URL):
- 
- {{{
- http://localhost:8983/solr/collection1/select?distrib=true
- }}}
- Query specific shard ids. In this example, the user has partitioned the index by date, creating
a new shard every month.
- 
- {{{
- http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001&distrib=true
- }}}
- NOT IMPLEMENTED: Query all shards of a compatible collection, explicitly specified:
- 
- {{{
- http://localhost:8983/solr/collection1/select?collection=collection1_recent
- }}}
- NOT IMPLEMENTED: Query all shards of multiple compatible collections, explicitly specified:
- 
- {{{
- http://localhost:8983/solr/collection1/select?collection=collection1_NY,collection1_NJ,collection1_CT
- }}}
  = Developer Section =
  
  == Tools ==

Mime
View raw message