Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17A5811187 for ; Tue, 22 Apr 2014 13:43:28 +0000 (UTC) Received: (qmail 1349 invoked by uid 500); 22 Apr 2014 13:43:25 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 1244 invoked by uid 500); 22 Apr 2014 13:43:24 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@spark.apache.org Delivered-To: mailing list user@spark.apache.org Received: (qmail 1221 invoked by uid 99); 22 Apr 2014 13:43:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2014 13:43:23 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of buendia360@gmail.com designates 209.85.213.50 as permitted sender) Received: from [209.85.213.50] (HELO mail-yh0-f50.google.com) (209.85.213.50) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2014 13:43:17 +0000 Received: by mail-yh0-f50.google.com with SMTP id t59so2908570yho.9 for ; Tue, 22 Apr 2014 06:42:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=jTXtwpxprkZqt1+Qgpzfh6ubQpR4YedUiQ0X9MWeutI=; b=ncIIWno34Minu4QHFel41I5WfMn07k4gpb0/JD7DcfGm7bpKbSo+686HP8qMTY+k/o RC9QqbBnUWLQBK+Zn4WJJIKbOa7U2jx8iLdPj0ZnUmKb/cMsz5rxzlcBNkFeV4nvwVJH /5PCCdEmugMdtd2KjRxibKqp8dDavqrUMezLKRAK1NqrDXGwHsqbOeLsxuMoEJ/u7gGW eORaV0j9HZhMqMoeBBA0ETNcPZmJTp04W8yxPWtC4fiiSRNGw9RjYTBYote6+uWccMmv 3IjNpd75QtvNiSlN42hJn1RyaQCXp10dTrV831jktraXGzIhra6TH9s46HXuOfywfbNc 4N0w== MIME-Version: 1.0 X-Received: by 10.236.160.67 with SMTP id t43mr60395261yhk.11.1398174174287; Tue, 22 Apr 2014 06:42:54 -0700 (PDT) Received: by 10.170.42.86 with HTTP; Tue, 22 Apr 2014 06:42:54 -0700 (PDT) In-Reply-To: References: Date: Tue, 22 Apr 2014 14:42:54 +0100 Message-ID: Subject: Re: Using google cloud storage for spark big data From: Aureliano Buendia To: user@spark.apache.org Cc: "user@spark.incubator.apache.org" Content-Type: multipart/alternative; boundary=20cf304353a8fe1e1b04f7a1ca27 X-Virus-Checked: Checked by ClamAV on apache.org --20cf304353a8fe1e1b04f7a1ca27 Content-Type: text/plain; charset=UTF-8 On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth < andras.nemeth@lynxanalytics.com> wrote: > We don't have anything fancy. It's basically some very thin layer of > google specifics on top of a stand alone cluster. We basically created two > disk snapshots, one for the master and one for the workers. The snapshots > contain initialization scripts so that the master/worker daemons are > started on boot. So if I want a cluster I just create a new instance (with > a fixed name) using the master snapshot for the master. When it is up I > start as many slave instances as I need using the slave snapshot. By the > time the machines are up the cluster is ready to be used. > > This sounds like being a lot simpler than the existing spark-ec2 script. Does google compute engine api makes this happen in a simple way, when compared to ec2 api? Does your script do everything spark-ec2 does? Also, any plans to make this open source? > Andras > > > > On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi wrote: > >> Okay just commented on another thread :) >> I have one that I use internally. Can give it out but will need some >> support from you to fix bugs etc. Let me know if you are interested. >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi >> >> >> >> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia wrote: >> >>> Thanks, Andras. What approach did you use to setup a spark cluster on >>> google compute engine? Currently, there is no production-ready official >>> support for an equivalent of spark-ec2 on gce. Did you roll your own? >>> >>> >>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth < >>> andras.nemeth@lynxanalytics.com> wrote: >>> >>>> Hello! >>>> >>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia < >>>> buendia360@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Google has publisheed a new connector for hadoop: google cloud >>>>> storage, which is an equivalent of amazon s3: >>>>> >>>>> >>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html >>>>> >>>> This is actually about Cloud Datastore and not Cloud Storage (yeah, >>>> quite confusing naming ;) ). But they do already have for a while a cloud >>>> storage connector, also linked from your article: >>>> https://developers.google.com/hadoop/google-cloud-storage-connector >>>> >>>> >>>>> >>>>> >>>>> How can spark be configured to use this connector? >>>>> >>>> Yes, it can, but in a somewhat hacky way. The problem is that for some >>>> reason Google does not officially publish the library jar alone, you get it >>>> installed as part of a Hadoop on Google Cloud installation. So, the >>>> official way would be (we did not try that) to have a Hadoop on Google >>>> Cloud installation and run spark on top of that. >>>> >>>> The other option - that we did try and which works fine for us - is to >>>> snatch the jar: >>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, >>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when >>>> you create your SparkContext). Then create a core-site.xml file which you >>>> make sure is on the classpath both in your driver and your cluster (e.g. >>>> you can make sure it ends up in one of the jars you send with setJars >>>> above) with this content (with YOUR_* replaced): >>>> >>>> >>>> fs.gs.implcom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem >>>> fs.gs.project.id >>>> YOUR_PROJECT_ID >>>> >>>> fs.gs.system.bucketYOUR_FAVORITE_BUCKET >>>> >>>> >>>> From this point on you can simply use gs://... filenames to read/write >>>> data on Cloud Storage. >>>> >>>> Note that you should run your cluster and driver program on Google >>>> Compute Engine for this to work as is. Probably it's possible to configure >>>> access from the outside too but we didn't do that. >>>> >>>> Hope this helps, >>>> Andras >>>> >>>> >>>> >>>> >>>> >>> >> > --20cf304353a8fe1e1b04f7a1ca27 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable



On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth &= lt;and= ras.nemeth@lynxanalytics.com> wrote:
We don't have anything = fancy. It's basically some very thin layer of google specifics on top o= f a stand alone cluster. We basically created two disk snapshots, one for t= he master and one for the workers. The snapshots contain initialization scr= ipts so that the master/worker daemons are started on boot. So if I want a = cluster I just create a new instance (with a fixed name) using the master s= napshot for the master. When it is up I start as many slave instances as I = need using the slave snapshot. By the time the machines are up the cluster = is ready to be used.


This sounds = like being a lot simpler than the existing spark-ec2 script. Does google co= mpute engine api makes this happen in a simple way, when compared to ec2 ap= i? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?
=C2= =A0
Andras



On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <mayur.rus= tagi@gmail.com> wrote:
Okay just commented on anot= her thread :)=C2=A0
I have one that I use internally. Can give it out b= ut will need some support from you to fix bugs etc. Let me know if you are = interested.=C2=A0



On Fri, Apr 18, 2014 at 9:08 PM, Aurelia= no Buendia <buendia360@gmail.com> wrote:
Thanks, Andras. What approach did you use to setup a spark= cluster on google compute engine? Currently, there is no production-ready = official support for an equivalent of spark-ec2 on gce. Did you roll your o= wn?


On Thu, Apr 1= 7, 2014 at 10:24 AM, Andras Nemeth <andras.nemeth@lynxanalyt= ics.com> wrote:
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, = Aureliano Buendia <buendia360@gmail.com> wrote:
Hi,

Google has pub= lisheed a new connector for hadoop: google cloud storage, which is an equiv= alent of amazon s3:

googlecloud= platform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-= hadoop.html
This is actually about Cloud Datastore and no= t Cloud Storage (yeah, quite confusing naming ;) ). But they do already hav= e for a while a cloud storage connector, also linked from your article:
=C2=A0


How can spark be configured to use this connector?
=
Yes, it can, but in a somewhat hacky way. The= problem is that for some reason Google does not officially publish the lib= rary jar alone, you get it installed as part of a Hadoop on Google Cloud in= stallation. So, the official way would be (we did not try that) to have a H= adoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine= for us - is to snatch the jar: https://storage.go= ogleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's= shipped to your workers (e.g. with setJars on SparkConf when you create yo= ur SparkContext). Then create a core-site.xml file which you make sure is o= n the classpath both in your driver and your cluster (e.g. you can make sur= e it ends up in one of the jars you send with setJars above) with this cont= ent (with YOUR_* replaced):
<configuration>
=C2=A0 <property><name&g= t;fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.Google= HadoopFileSystem</value></property>
=C2=A0 <proper= ty><name>fs.= gs.project.id</name><value>YOUR_PROJECT_ID</value><= ;/property>
=C2=A0 <property><name>fs.gs.system.bucket</name><= ;value>YOUR_FAVORITE_BUCKET</value></property>
<= ;/configuration>

From this point on you c= an simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver progra= m on Google Compute Engine for this to work as is. Probably it's possib= le to configure access from the outside too but we didn't do that.

Hope this helps,
Andras

<= div>
=C2=A0





--20cf304353a8fe1e1b04f7a1ca27--