spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aureliano Buendia <buendia...@gmail.com>
Subject Re: Using google cloud storage for spark big data
Date Tue, 22 Apr 2014 13:42:54 GMT
On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <
andras.nemeth@lynxanalytics.com> wrote:

> We don't have anything fancy. It's basically some very thin layer of
> google specifics on top of a stand alone cluster. We basically created two
> disk snapshots, one for the master and one for the workers. The snapshots
> contain initialization scripts so that the master/worker daemons are
> started on boot. So if I want a cluster I just create a new instance (with
> a fixed name) using the master snapshot for the master. When it is up I
> start as many slave instances as I need using the slave snapshot. By the
> time the machines are up the cluster is ready to be used.
>
>
This sounds like being a lot simpler than the existing spark-ec2 script.
Does google compute engine api makes this happen in a simple way, when
compared to ec2 api? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?


> Andras
>
>
>
> On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> Okay just commented on another thread :)
>> I have one that I use internally. Can give it out but will need some
>> support from you to fix bugs etc. Let me know if you are interested.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <buendia360@gmail.com>wrote:
>>
>>> Thanks, Andras. What approach did you use to setup a spark cluster on
>>> google compute engine? Currently, there is no production-ready official
>>> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
>>> andras.nemeth@lynxanalytics.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <
>>>> buendia360@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Google has publisheed a new connector for hadoop: google cloud
>>>>> storage, which is an equivalent of amazon s3:
>>>>>
>>>>>
>>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>>>
>>>> This is actually about Cloud Datastore and not Cloud Storage (yeah,
>>>> quite confusing naming ;) ). But they do already have for a while a cloud
>>>> storage connector, also linked from your article:
>>>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>>>
>>>>
>>>>>
>>>>>
>>>>> How can spark be configured to use this connector?
>>>>>
>>>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>>>> reason Google does not officially publish the library jar alone, you get
it
>>>> installed as part of a Hadoop on Google Cloud installation. So, the
>>>> official way would be (we did not try that) to have a Hadoop on Google
>>>> Cloud installation and run spark on top of that.
>>>>
>>>> The other option - that we did try and which works fine for us - is to
>>>> snatch the jar:
>>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>>>> you create your SparkContext). Then create a core-site.xml file which you
>>>> make sure is on the classpath both in your driver and your cluster (e.g.
>>>> you can make sure it ends up in one of the jars you send with setJars
>>>> above) with this content (with YOUR_* replaced):
>>>> <configuration>
>>>>
>>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>>>   <property><name>fs.gs.project.id
>>>> </name><value>YOUR_PROJECT_ID</value></property>
>>>>
>>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>>>> </configuration>
>>>>
>>>> From this point on you can simply use gs://... filenames to read/write
>>>> data on Cloud Storage.
>>>>
>>>> Note that you should run your cluster and driver program on Google
>>>> Compute Engine for this to work as is. Probably it's possible to configure
>>>> access from the outside too but we didn't do that.
>>>>
>>>> Hope this helps,
>>>> Andras
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message