spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Missing Spark URL after staring the master
Date Tue, 04 Mar 2014 17:59:09 GMT
I have on cloudera vm
http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
which version are you trying to setup on cloudera.. also which cloudera
version are you using...


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang <binwang.cu@gmail.com> wrote:

> Hi Ognen/Mayur,
>
> Thanks for the reply and it is good to know how easy it is to setup Spark
> on AWS cluster.
>
> My situation is a bit different from yours, our company already have a
> cluster and it really doesn't make that much sense not to use them. That is
> why I have been "going through" this. I really wish there are some
> tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster
> or .. some way to tweak the CDH Spark distribution, so it is up to date.
>
> Ognen, of course it will be very helpful if you can 'history | grep
> spark... ' and document the work that you have done since you've already
> made it!
>
> Bin
>
>
>
> On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski <
> ognen@plainvanillagames.com> wrote:
>
>>  I should add that in this setup you really do not need to look for the
>> printout of the master node's IP - you set it yourself a priori. If anyone
>> is interested, let me know, I can write it all up so that people can follow
>> some set of instructions. Who knows, maybe I can come up with a set of
>> scripts to automate it all...
>>
>> Ognen
>>
>>
>>
>> On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
>>
>> I have a Standalone spark cluster running in an Amazon VPC that I set up
>> by hand. All I did was provision the machines from a common AMI image (my
>> underlying distribution is Ubuntu), I created a "sparkuser" on each machine
>> and I have a /home/sparkuser/spark folder where I downladed spark. I did
>> this on the master only, I did sbt/sbt assemble and I set up the
>> conf/spark-env.sh to point to the master which is an IP address (in my case
>> 10.10.0.200, the port is the default 7077). I also set up the slaves file
>> in the same subdirectory to have all 16 ip addresses of the worker nodes
>> (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I
>> then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz
>> file to each worker using the same "sparkuser" account and unpacked the
>> .tgz on each slave (this will effectively replicate everything from master
>> to all slaves - you can script this so you don't do it by hand).
>>
>> Your AMI should have the distribution's version of Java and git installed
>> by the way.
>>
>> All you have to do then is sparkuser@spark-master>
>> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh)
>> and it will all automagically start :)
>>
>> All my Amazon nodes come with 4x400 Gb of ephemeral space which I have
>> set up into a 1.6TB RAID0 array on each node and I am pooling this into an
>> HDFS filesystem which is operated by a namenode outside the spark cluster
>> while all the datanodes are the same nodes as the spark workers. This
>> enables replication and extremely fast access since ephemeral is much
>> faster than EBS or anything else on Amazon (you can do even better with SSD
>> drives on this setup but it will cost ya).
>>
>> If anyone is interested I can document our pipeline set up - I came up
>> with it myself and do not have a clue as to what the industry standards are
>> since I could not find any written instructions anywhere online about how
>> to set up a whole data analytics pipeline from the point of ingestion to
>> the point of analytics (people don't want to share their secrets? or am I
>> just in the dark and incapable of using Google properly?). My requirement
>> was that I wanted this to run within a VPC for added security and
>> simplicity, the Amazon security groups get really old quickly. Added bonus
>> is that you can use a VPN as an entry into the whole system and your
>> cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN
>> since I don't like Cisco nor Juniper (the only two options Amazon provides
>> for their VPN gateways).
>>
>> Ognen
>>
>>
>> On 3/3/14, 1:00 PM, Bin Wang wrote:
>>
>> Hi there,
>>
>>  I have a CDH cluster set up, and I tried using the Spark parcel come
>> with Cloudera Manager, but it turned out they even don't have the
>> run-example shell command in the bin folder. Then I removed it from the
>> cluster and cloned the incubator-spark into the name node of my cluster,
>> and built from source there successfully with everything as default.
>>
>>  I ran a few examples and everything seems work fine in the local mode.
>> Then I am thinking about scale it to my cluster, which is what the
>> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all
>> the datanodes to the slaves and think I should run Spark in the standalone
>> mode.
>>
>>  Say I am trying to set up Spark in the standalone mode following this
>> instruction:
>> https://spark.incubator.apache.org/docs/latest/spark-standalone.html
>> However, it says "Once started, the master will print out a
>> spark://HOST:PORT URL for itself, which you can use to connect workers
>> to it, or pass as the “master” argument to SparkContext. You can also
>> find this URL on the master’s web UI, which is http://localhost:8080 by
>> default."
>>
>>  After I started the master, there is no URL printed on the screen and
>> neither the web UI is running.
>> Here is the output:
>>  [root@box incubator-spark]# ./sbin/start-master.sh
>> starting org.apache.spark.deploy.master.Master, logging to
>> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out
>>
>>  First Question: am I even in the ballpark to run Spark in standalone
>> mode if I try to fully utilize my cluster? I saw there are four ways to
>> launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso,
>> Hadoop Yarn... which I guess standalone mode is the way to go?
>>
>>  Second Question: how to get the Spark URL of the cluster, why the
>> output is not like what the instruction says?
>>
>>  Best regards,
>>
>>  Bin
>>
>>
>> --
>> Some people, when confronted with a problem, think "I know, I'll use regular expressions."
Now they have two problems.
>> -- Jamie Zawinski
>>
>>
>> --
>> Some people, when confronted with a problem, think "I know, I'll use regular expressions."
Now they have two problems.
>> -- Jamie Zawinski
>>
>>
>

Mime
View raw message