Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of malouf.gary@gmail.com
 designates 209.85.216.52 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <28DC68C8-CF2D-4F6B-B67A-5E3712EDAB88@databricks.com>
References: 
 <CAGOvqipF8=+dXbKWSx81pT78OPSwdhBTC2+bMN6SRe6_UJk7Cw@mail.gmail.com>
	<CAAswR-4LY=SQ2E==jNbBf1kMG1wJnR6reZKjS+cCRj-o4fo-bg@mail.gmail.com>
	<3E25FA66-1932-4BAF-9482-F2A21717FDD7@databricks.com>
	<CAGOvqioGFSdMtdTT_Li+3mAdWCH4q48jPhejhh1EV+gCGR5sMg@mail.gmail.com>
	<28DC68C8-CF2D-4F6B-B67A-5E3712EDAB88@databricks.com>
Date: Fri, 14 Nov 2014 21:29:19 -0500
Message-ID: 
 <CAGOvqio0pYtX7c8+zL1ZCF5kCOxPSE-aM_EeRXWKzWLMc53q3Q@mail.gmail.com>
Subject: Re: Sourcing data from RedShift
From: Gary Malouf <malouf.gary@gmail.com>
To: Xiangrui Meng <meng@databricks.com>
Cc: Michael Armbrust <michael@databricks.com>,
 "user@spark.apache.org" <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=089e014933e843970e0507dc8307

--089e014933e843970e0507dc8307
Content-Type: text/plain; charset=UTF-8

I'll try this out and follow up with what I find.

On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng <meng@databricks.com> wrote:

> For each node, if the CSV reader is implemented efficiently, you should be
> able to hit at least half of the theoretical network bandwidth, which is
> about 60MB/second/node. So if you just do counting, the expect time should
> be within 3 minutes.
>
> Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the
> default spark.storage.memoryFraction, it can barely cache 100GB of data,
> not considering the overhead. So if your operation need to cache the data
> to be efficient, you may need a larger cluster or change the storage level
> to MEMORY_AND_DISK.
>
> -Xiangrui
>
> On Nov 14, 2014, at 5:32 PM, Gary Malouf <malouf.gary@gmail.com> wrote:
>
> Hmm, we actually read the CSV data in S3 now and were looking to avoid
> that.  Unfortunately, we've experienced dreadful performance reading 100GB
> of text data for a job directly from S3 - our hope had been connecting
> directly to Redshift would provide some boost.
>
> We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
> # of cpus across cluster) and increasing partitions during reading did not
> seem to help.
>
> On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <meng@databricks.com>
> wrote:
>
>> Michael is correct. Using direct connection to dump data would be slow
>> because there is only a single connection. Please use UNLOAD with ESCAPE
>> option to dump the table to S3. See instructions at
>>
>> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>>
>> And then load them back using the redshift input format we wrote:
>> https://github.com/databricks/spark-redshift (we moved the
>> implementation to github/databricks). Right now all columns are loaded as
>> string columns, and you need to do type casting manually. We plan to add a
>> parser that can translate Redshift table schema directly to Spark SQL
>> schema, but no ETA yet.
>>
>> -Xiangrui
>>
>> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <michael@databricks.com>
>> wrote:
>>
>> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
>> command used to produce the data.  Xiangrui can correct me if I'm wrong
>> though.
>>
>> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <malouf.gary@gmail.com>
>> wrote:
>>
>>> We have a bunch of data in RedShift tables that we'd like to pull in
>>> during job runs to Spark.  What is the path/url format one uses to pull
>>> data from there?  (This is in reference to using the
>>> https://github.com/mengxr/redshift-input-format)
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

--089e014933e843970e0507dc8307
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;ll try this out and follow up with what I find.</div=
><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Nov 14, =
2014 at 8:54 PM, Xiangrui Meng <span dir=3D"ltr">&lt;<a href=3D"mailto:meng=
@databricks.com" target=3D"_blank">meng@databricks.com</a>&gt;</span> wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word"><di=
v>For each node, if the CSV reader is implemented efficiently, you should b=
e able to hit at least half of the theoretical network bandwidth, which is =
about 60MB/second/node. So if you just do counting, the expect time should =
be within 3 minutes.</div><div><br></div><div>Note that your cluster have 1=
5GB * 12 =3D 180GB RAM in total. If you use the default spark.storage.memor=
yFraction, it can barely cache 100GB of data, not considering the overhead.=
 So if your operation need to cache the data to be efficient, you may need =
a larger cluster or change the storage level to MEMORY_AND_DISK.</div><span=
 class=3D"HOEnZb"><font color=3D"#888888"><div><br></div><div>-Xiangrui</di=
v></font></span><div><div class=3D"h5"><br><div><blockquote type=3D"cite"><=
div>On Nov 14, 2014, at 5:32 PM, Gary Malouf &lt;<a href=3D"mailto:malouf.g=
ary@gmail.com" target=3D"_blank">malouf.gary@gmail.com</a>&gt; wrote:</div>=
<br><div><div dir=3D"ltr">Hmm, we actually read the CSV data in S3 now and =
were looking to avoid that.=C2=A0 Unfortunately, we&#39;ve experienced drea=
dful performance reading 100GB of text data for a job directly from S3 - ou=
r hope had been connecting directly to Redshift would provide some boost. =
=C2=A0<div><br></div><div>We had been using 12 m3.xlarges, but increasing d=
efault parallelism (to 2x # of cpus across cluster) and increasing partitio=
ns during reading did not seem to help.</div></div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui=
 Meng <span dir=3D"ltr">&lt;<a href=3D"mailto:meng@databricks.com" target=
=3D"_blank">meng@databricks.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div style=3D"word-wrap:break-word"><div>Michael is correct. =
Using direct connection to dump data would be slow because there is only a =
single connection. Please use UNLOAD with ESCAPE option to dump the table t=
o S3. See instructions at</div><div><br></div><div><a href=3D"http://docs.a=
ws.amazon.com/redshift/latest/dg/r_UNLOAD.html" target=3D"_blank">http://do=
cs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html</a></div><div><br></div>=
<div>And then load them back using the redshift input format we wrote:=C2=
=A0<a href=3D"https://github.com/databricks/spark-redshift" target=3D"_blan=
k">https://github.com/databricks/spark-redshift</a>=C2=A0(we moved the impl=
ementation to github/databricks). Right now all columns are loaded as strin=
g columns, and you need to do type casting manually. We plan to add a parse=
r that can translate Redshift table schema directly to Spark SQL schema, bu=
t no ETA yet.</div><span><font color=3D"#888888"><div><br></div><div>-Xiang=
rui</div></font></span><div><div><br><div><blockquote type=3D"cite"><div>On=
 Nov 14, 2014, at 3:46 PM, Michael Armbrust &lt;<a href=3D"mailto:michael@d=
atabricks.com" target=3D"_blank">michael@databricks.com</a>&gt; wrote:</div=
><br><div><div dir=3D"ltr">I&#39;d guess that its an <a>s3n://key:secret_ke=
y@bucket/path</a> from the UNLOAD command used to produce the data.=C2=A0 X=
iangrui can correct me if I&#39;m wrong though.</div><div class=3D"gmail_ex=
tra"><br><div class=3D"gmail_quote">On Fri, Nov 14, 2014 at 2:19 PM, Gary M=
alouf <span dir=3D"ltr">&lt;<a href=3D"mailto:malouf.gary@gmail.com" target=
=3D"_blank">malouf.gary@gmail.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex"><div dir=3D"ltr">We have a bunch of data in RedShift tables =
that we&#39;d like to pull in during job runs to Spark.=C2=A0 What is the p=
ath/url format one uses to pull data from there? =C2=A0(This is in referenc=
e to using the=C2=A0<a href=3D"https://github.com/mengxr/redshift-input-for=
mat" target=3D"_blank">https://github.com/mengxr/redshift-input-format</a>)=
<div><br></div><div><br><div><br></div><div><br></div></div></div>
</blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></blockquote></div><br></div=
>
</div></blockquote></div><br></div></div></div></blockquote></div><br></div=
>

--089e014933e843970e0507dc8307--