Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@spark.apache.org
Received-SPF: pass (nike.apache.org: domain of buendia360@gmail.com designates
 209.85.213.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALkcWtnivXAM0+g+TLXw3n38VYZ-opm=va4UgrrqS-L962ksNw@mail.gmail.com>
References: 
 <CAB89JJV9x1qh6SX1m=DSEh3JHxZLTWwZRySnQ3Nu5bKUF4XdgQ@mail.gmail.com>
	<CALkcWtm-aVN098cbWbbhLUf+Tfv9qLuxhRe=bJjN+hpVHXNytQ@mail.gmail.com>
	<CAB89JJX+uE6F7sA_F9vG4g1JuJ8EnORKt-j_3mo69GZsGuCOaQ@mail.gmail.com>
	<CAAqHKj4sQ99e+U7pLQwYFOghqm0Pr4LK7Jgz4S2GHZ=6t3e+8g@mail.gmail.com>
	<CALkcWtnivXAM0+g+TLXw3n38VYZ-opm=va4UgrrqS-L962ksNw@mail.gmail.com>
Date: Tue, 22 Apr 2014 14:42:54 +0100
Message-ID: 
 <CAB89JJW=cs_DcoKZTuCDJ+cwVgWDPwrF5phK6ZT4PCoLCykegw@mail.gmail.com>
Subject: Re: Using google cloud storage for spark big data
From: Aureliano Buendia <buendia360@gmail.com>
To: user@spark.apache.org
Cc: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
Content-Type: multipart/alternative; boundary=20cf304353a8fe1e1b04f7a1ca27

--20cf304353a8fe1e1b04f7a1ca27
Content-Type: text/plain; charset=UTF-8

On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <
andras.nemeth@lynxanalytics.com> wrote:

> We don't have anything fancy. It's basically some very thin layer of
> google specifics on top of a stand alone cluster. We basically created two
> disk snapshots, one for the master and one for the workers. The snapshots
> contain initialization scripts so that the master/worker daemons are
> started on boot. So if I want a cluster I just create a new instance (with
> a fixed name) using the master snapshot for the master. When it is up I
> start as many slave instances as I need using the slave snapshot. By the
> time the machines are up the cluster is ready to be used.
>
>
This sounds like being a lot simpler than the existing spark-ec2 script.
Does google compute engine api makes this happen in a simple way, when
compared to ec2 api? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?


> Andras
>
>
>
> On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> Okay just commented on another thread :)
>> I have one that I use internally. Can give it out but will need some
>> support from you to fix bugs etc. Let me know if you are interested.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <buendia360@gmail.com>wrote:
>>
>>> Thanks, Andras. What approach did you use to setup a spark cluster on
>>> google compute engine? Currently, there is no production-ready official
>>> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
>>> andras.nemeth@lynxanalytics.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <
>>>> buendia360@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Google has publisheed a new connector for hadoop: google cloud
>>>>> storage, which is an equivalent of amazon s3:
>>>>>
>>>>>
>>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>>>
>>>> This is actually about Cloud Datastore and not Cloud Storage (yeah,
>>>> quite confusing naming ;) ). But they do already have for a while a cloud
>>>> storage connector, also linked from your article:
>>>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>>>
>>>>
>>>>>
>>>>>
>>>>> How can spark be configured to use this connector?
>>>>>
>>>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>>>> reason Google does not officially publish the library jar alone, you get it
>>>> installed as part of a Hadoop on Google Cloud installation. So, the
>>>> official way would be (we did not try that) to have a Hadoop on Google
>>>> Cloud installation and run spark on top of that.
>>>>
>>>> The other option - that we did try and which works fine for us - is to
>>>> snatch the jar:
>>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>>>> you create your SparkContext). Then create a core-site.xml file which you
>>>> make sure is on the classpath both in your driver and your cluster (e.g.
>>>> you can make sure it ends up in one of the jars you send with setJars
>>>> above) with this content (with YOUR_* replaced):
>>>> <configuration>
>>>>
>>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>>>   <property><name>fs.gs.project.id
>>>> </name><value>YOUR_PROJECT_ID</value></property>
>>>>
>>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>>>> </configuration>
>>>>
>>>> From this point on you can simply use gs://... filenames to read/write
>>>> data on Cloud Storage.
>>>>
>>>> Note that you should run your cluster and driver program on Google
>>>> Compute Engine for this to work as is. Probably it's possible to configure
>>>> access from the outside too but we didn't do that.
>>>>
>>>> Hope this helps,
>>>> Andras
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

--20cf304353a8fe1e1b04f7a1ca27
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <span dir=3D"ltr">&=
lt;<a href=3D"mailto:andras.nemeth@lynxanalytics.com" target=3D"_blank">and=
ras.nemeth@lynxanalytics.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">We don&#39;t have anything =
fancy. It&#39;s basically some very thin layer of google specifics on top o=
f a stand alone cluster. We basically created two disk snapshots, one for t=
he master and one for the workers. The snapshots contain initialization scr=
ipts so that the master/worker daemons are started on boot. So if I want a =
cluster I just create a new instance (with a fixed name) using the master s=
napshot for the master. When it is up I start as many slave instances as I =
need using the slave snapshot. By the time the machines are up the cluster =
is ready to be used.<span class=3D"HOEnZb"><font color=3D"#888888"><div>

<br></div></font></span></div></blockquote><div><br></div><div>This sounds =
like being a lot simpler than the existing spark-ec2 script. Does google co=
mpute engine api makes this happen in a simple way, when compared to ec2 ap=
i? Does your script do everything spark-ec2 does?<br>
<br></div><div>Also, any plans to make this open source?<br></div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><span class=3D"HOE=
nZb"><font color=3D"#888888"><div>
</div><div>Andras</div><div><br></div></font></span></div><div class=3D"HOE=
nZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gma=
il_quote">On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:mayur.rustagi@gmail.com" target=3D"_blank">mayur.rus=
tagi@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Okay just commented on anot=
her thread :)=C2=A0<div>I have one that I use internally. Can give it out b=
ut will need some support from you to fix bugs etc. Let me know if you are =
interested.=C2=A0</div>

</div><div class=3D"gmail_extra">

<br clear=3D"all"><div><div dir=3D"ltr">Mayur Rustagi<br>Ph: <a href=3D"tel=
:%2B1%20%28760%29%20203%203257" value=3D"+17602033257" target=3D"_blank">+1=
 (760) 203 3257</a><div><a href=3D"http://www.sigmoidanalytics.com" target=
=3D"_blank">http://www.sigmoidanalytics.com</a></div>

<span><font color=3D"#888888"><div><div><a href=3D"https://twitter.com/mayu=
r_rustagi" target=3D"_blank">@mayur_rustagi</a></div>

<div><br></div></div></font></span></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Fri, Apr 18, 2014 at 9:08 PM, Aurelia=
no Buendia <span dir=3D"ltr">&lt;<a href=3D"mailto:buendia360@gmail.com" ta=
rget=3D"_blank">buendia360@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr">Thanks, Andras. What approach did you use to setup a spark=
 cluster on google compute engine? Currently, there is no production-ready =
official support for an equivalent of spark-ec2 on gce. Did you roll your o=
wn?</div>


<div><div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu, Apr 1=
7, 2014 at 10:24 AM, Andras Nemeth <span dir=3D"ltr">&lt;<a href=3D"mailto:=
andras.nemeth@lynxanalytics.com" target=3D"_blank">andras.nemeth@lynxanalyt=
ics.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hello!<div class=3D"gmail_e=
xtra"><br><div class=3D"gmail_quote"><div>On Wed, Apr 16, 2014 at 7:59 PM, =
Aureliano Buendia <span dir=3D"ltr">&lt;<a href=3D"mailto:buendia360@gmail.=
com" target=3D"_blank">buendia360@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr"><div>Hi,<br><br></div><div>Google has pub=
lisheed a new connector for hadoop: google cloud storage, which is an equiv=
alent of amazon s3:<br>


<br><a href=3D"http://googlecloudplatform.blogspot.com/2014/04/google-bigqu=
ery-and-datastore-connectors-for-hadoop.html" target=3D"_blank">googlecloud=
platform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-=
hadoop.html</a></div>


</div></blockquote></div><div>This is actually about Cloud Datastore and no=
t Cloud Storage (yeah, quite confusing naming ;) ). But they do already hav=
e for a while a cloud storage connector, also linked from your article:</di=
v>


<div><a href=3D"https://developers.google.com/hadoop/google-cloud-storage-c=
onnector" target=3D"_blank">https://developers.google.com/hadoop/google-clo=
ud-storage-connector</a></div><div><div>=C2=A0<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;bo=
rder-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div dir=3D"ltr"><div><br>
<br></div><div>How can spark be configured to use this connector?<br></div>=
</div></blockquote></div><div>Yes, it can, but in a somewhat hacky way. The=
 problem is that for some reason Google does not officially publish the lib=
rary jar alone, you get it installed as part of a Hadoop on Google Cloud in=
stallation. So, the official way would be (we did not try that) to have a H=
adoop on Google Cloud installation and run spark on top of that.</div>


<div><br></div><div>The other option - that we did try and which works fine=
 for us - is to snatch the jar: <a href=3D"https://storage.googleapis.com/h=
adoop-lib/gcs/gcs-connector-1.2.4.jar" target=3D"_blank">https://storage.go=
ogleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar</a>, make sure it&#39;s=
 shipped to your workers (e.g. with setJars on SparkConf when you create yo=
ur SparkContext). Then create a core-site.xml file which you make sure is o=
n the classpath both in your driver and your cluster (e.g. you can make sur=
e it ends up in one of the jars you send with setJars above) with this cont=
ent (with YOUR_* replaced):</div>


<div><div>&lt;configuration&gt;</div><div>=C2=A0 &lt;property&gt;&lt;name&g=
t;fs.gs.impl&lt;/name&gt;&lt;value&gt;com.google.cloud.hadoop.fs.gcs.Google=
HadoopFileSystem&lt;/value&gt;&lt;/property&gt;</div><div>=C2=A0 &lt;proper=
ty&gt;&lt;name&gt;<a href=3D"http://fs.gs.project.id" target=3D"_blank">fs.=
gs.project.id</a>&lt;/name&gt;&lt;value&gt;YOUR_PROJECT_ID&lt;/value&gt;&lt=
;/property&gt;</div>


<div>=C2=A0 &lt;property&gt;&lt;name&gt;fs.gs.system.bucket&lt;/name&gt;&lt=
;value&gt;YOUR_FAVORITE_BUCKET&lt;/value&gt;&lt;/property&gt;</div><div>&lt=
;/configuration&gt;</div></div><div><br></div><div>From this point on you c=
an simply use gs://... filenames to read/write data on Cloud Storage.</div>


<div><br></div><div>Note that you should run your cluster and driver progra=
m on Google Compute Engine for this to work as is. Probably it&#39;s possib=
le to configure access from the outside too but we didn&#39;t do that.</div=
>


<div><br></div><div>Hope this helps,</div><div>Andras</div><div><br></div><=
div><br></div><div>=C2=A0</div></div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>

--20cf304353a8fe1e1b04f7a1ca27--