spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Malouf <malouf.g...@gmail.com>
Subject Re: Sourcing data from RedShift
Date Sat, 15 Nov 2014 02:29:19 GMT
I'll try this out and follow up with what I find.

On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng <meng@databricks.com> wrote:

> For each node, if the CSV reader is implemented efficiently, you should be
> able to hit at least half of the theoretical network bandwidth, which is
> about 60MB/second/node. So if you just do counting, the expect time should
> be within 3 minutes.
>
> Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the
> default spark.storage.memoryFraction, it can barely cache 100GB of data,
> not considering the overhead. So if your operation need to cache the data
> to be efficient, you may need a larger cluster or change the storage level
> to MEMORY_AND_DISK.
>
> -Xiangrui
>
> On Nov 14, 2014, at 5:32 PM, Gary Malouf <malouf.gary@gmail.com> wrote:
>
> Hmm, we actually read the CSV data in S3 now and were looking to avoid
> that.  Unfortunately, we've experienced dreadful performance reading 100GB
> of text data for a job directly from S3 - our hope had been connecting
> directly to Redshift would provide some boost.
>
> We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
> # of cpus across cluster) and increasing partitions during reading did not
> seem to help.
>
> On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <meng@databricks.com>
> wrote:
>
>> Michael is correct. Using direct connection to dump data would be slow
>> because there is only a single connection. Please use UNLOAD with ESCAPE
>> option to dump the table to S3. See instructions at
>>
>> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>>
>> And then load them back using the redshift input format we wrote:
>> https://github.com/databricks/spark-redshift (we moved the
>> implementation to github/databricks). Right now all columns are loaded as
>> string columns, and you need to do type casting manually. We plan to add a
>> parser that can translate Redshift table schema directly to Spark SQL
>> schema, but no ETA yet.
>>
>> -Xiangrui
>>
>> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <michael@databricks.com>
>> wrote:
>>
>> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
>> command used to produce the data.  Xiangrui can correct me if I'm wrong
>> though.
>>
>> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <malouf.gary@gmail.com>
>> wrote:
>>
>>> We have a bunch of data in RedShift tables that we'd like to pull in
>>> during job runs to Spark.  What is the path/url format one uses to pull
>>> data from there?  (This is in reference to using the
>>> https://github.com/mengxr/redshift-input-format)
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message