Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com
 designates 209.85.217.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <2493CAF423E7104B9F5C6D863AEAD3C00DB3915C@CERNMSGLS5MB3A.cerner.net>
References: 
 <2493CAF423E7104B9F5C6D863AEAD3C00DB390FD@CERNMSGLS5MB3A.cerner.net>
 <2493CAF423E7104B9F5C6D863AEAD3C00DB3915C@CERNMSGLS5MB3A.cerner.net>
From: Josh Wills <jwills@cloudera.com>
Date: Mon, 9 Sep 2013 10:44:08 -0700
Message-ID: 
 <CAH29n6MLOpDKr3bvg_mgFO3M17RDLW_A-OAPc+Dc+XFhGCGZzQ@mail.gmail.com>
Subject: Re: Writing MapFile through Crunch, issue reading through Hadoop
To: user@crunch.apache.org
Content-Type: multipart/alternative; boundary=089e0160a432a1d7a004e5f6f0d5

--089e0160a432a1d7a004e5f6f0d5
Content-Type: text/plain; charset=ISO-8859-1

Tough to assign blame here-- writing a _SUCCESS bit is usually a good
thing, and most Hadoop file formats are smart about filtering out files
that start with "_" or ".", or allowing you to specify an instance of
PathFilter that can be used to ignore hidden files.

One way around this would be to add an option to Targets that would disable
writing the _SUCCESS flag, which would be part of a more general change to
allow per-Source and per-Target configuration options. For example, you
could specify that some outputs of an MR job were compressed using gzip,
and others were compressed using Snappy, instead of having a single
compression strategy for everything.


On Mon, Sep 9, 2013 at 10:28 AM, Hansen,Chuck <Chuck.Hansen@cerner.com>wrote:

>   With Crunch versions prior to 0.7.x, there does not appear to be an
> _SUCCESS file written upon completion, starting with 0.7.x there is.  This
> file (and any others not intended to be read through [1]) appears to cause
> issue with [1].  This means writing a MapFile with crunch and reading back
> with [1] works prior to 0.7.x, but starting with 0.7.x, [1] will throw an
> exception.
>
>  Is this a bug with Crunch and/or Hadoop?
>
>  [1] org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.*
> getReaders*
> *
> *
> Hadoop CDH versions used:
>
>     <hadoopCoreVersion>2.0.0-mr1-cdh4.2.1</hadoopCoreVersion>
>
>     <hadoop_commonAndHDFSVersion>2.0.0-cdh4.2.1</
> hadoop_commonAndHDFSVersion>
>
>  --
>  *Chuck Hansen*
> Software Engineer, Record Dev
> chuck.hansen@cerner.com | 816-201-9629
> Cerner Corporation | www.cerner.com
>    CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

--089e0160a432a1d7a004e5f6f0d5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Tough to assign blame here-- writing a _SUCCESS bit is usu=
ally a good thing, and most Hadoop file formats are smart about filtering o=
ut files that start with &quot;_&quot; or &quot;.&quot;, or allowing you to=
 specify an instance of PathFilter that can be used to ignore hidden files.=
<div>

<br></div><div>One way around this would be to add an option to Targets tha=
t would disable writing the _SUCCESS flag, which would be part of a more ge=
neral change to allow per-Source and per-Target configuration options. For =
example, you could specify that some outputs of an MR job were compressed u=
sing gzip, and others were compressed using Snappy, instead of having a sin=
gle compression strategy for everything.</div>

<div><br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Mon, Sep 9, 2013 at 10:28 AM, Hansen,Chuck <span dir=3D"ltr">&lt=
;<a href=3D"mailto:Chuck.Hansen@cerner.com" target=3D"_blank">Chuck.Hansen@=
cerner.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>
<div>
<div>With Crunch versions prior to 0.7.x, there does not appear to be an _S=
UCCESS file written upon completion, starting with 0.7.x there is. =A0This =
file (and any others not intended to be read through [1]) appears to cause =
issue with [1]. =A0This means writing
 a MapFile with crunch and reading back with [1] works prior to 0.7.x, but =
starting with 0.7.x, [1] will throw an exception.=A0</div>
</div>
</div>
<span>
<div>
<div style=3D"word-wrap:break-word">
<div><br>
</div>
<div>Is this a bug with Crunch and/or Hadoop?</div>
<div><br>
</div>
<div>[1]=A0org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.<i>ge=
tReaders</i></div>
<div><i><br>
</i></div>
<div>Hadoop CDH versions used:=A0</div>
<div>
<p style=3D"margin:0px;font-size:13px;font-family:&#39;Anonymous Pro&#39;;c=
olor:rgb(78,145,146)">
<span style=3D"color:#000000">=A0 =A0=A0</span><span style=3D"color:#009193=
">&lt;</span>hadoopCoreVersion<span style=3D"color:#009193">&gt;</span><spa=
n style=3D"color:#000000">2.0.0-mr1-cdh4.2.1</span><span style=3D"color:#00=
9193">&lt;/</span>hadoopCoreVersion<span style=3D"color:#009193">&gt;</span=
><span style=3D"color:#000000">=A0</span></p>


<p style=3D"margin:0px;font-size:13px;font-family:&#39;Anonymous Pro&#39;;c=
olor:rgb(78,145,146)">
<span style=3D"color:#000000">=A0 =A0 </span><span style=3D"color:#009193">=
&lt;</span>hadoop_commonAndHDFSVersion<span style=3D"color:#009193">&gt;</s=
pan><span style=3D"color:#000000">2.0.0-cdh4.2.1</span><span style=3D"color=
:#009193">&lt;/</span>hadoop_commonAndHDFSVersion<span style=3D"color:#0091=
93">&gt;</span><span style=3D"color:#000000">=A0</span></p>

<span class=3D"HOEnZb"><font color=3D"#888888">
</font></span></div><span class=3D"HOEnZb"><font color=3D"#888888">
<div><br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div>--=A0</div>
<div>
<div style=3D"font-family:Calibri;font-size:15px"><font face=3D"Arial" colo=
r=3D"#0D94D2"><span style=3D"font-size:10pt"><b>Chuck Hansen</b></span></fo=
nt></div>
<div style=3D"font-family:Calibri;font-size:15px"><font face=3D"Arial" colo=
r=3D"#6A737B"><span style=3D"font-size:10pt">Software Engineer, Record Dev<=
/span></font></div>
<div style=3D"font-family:Calibri;font-size:15px"><font face=3D"Arial" colo=
r=3D"#6A737B"><span style=3D"font-size:10pt"><a href=3D"mailto:chuck.hansen=
@cerner.com" target=3D"_blank">chuck.hansen@cerner.com</a>=A0| <a href=3D"t=
el:816-201-9629" value=3D"+18162019629" target=3D"_blank">816-201-9629</a><=
/span></font></div>


<div style=3D"font-family:Calibri;font-size:15px"><font face=3D"Arial" colo=
r=3D"#6A737B"><span style=3D"font-size:10pt">Cerner Corporation |=A0<a href=
=3D"http://www.cerner.com/" target=3D"_blank">www.cerner.com</a></span></fo=
nt></div>


</div>
</div>
</font></span></div><span class=3D"HOEnZb"><font color=3D"#888888">
</font></span></div><span class=3D"HOEnZb"><font color=3D"#888888">
</font></span></span><span class=3D"HOEnZb"><font color=3D"#888888">

<div>
CONFIDENTIALITY NOTICE This message and any included attachments are from C=
erner Corporation and are intended only for the addressee. The information =
contained in this message is confidential and may constitute inside or non-=
public information under international, federal, or state securities laws. =
Unauthorized forwarding, printing, copying, distribution, or use of such in=
formation is strictly prohibited and may be unlawful. If you are not the ad=
dressee, please promptly delete this message and notify the sender of the d=
elivery error by e-mail or you may call Cerner&#39;s corporate offices in K=
ansas City, Missouri, U.S.A at <a href=3D"tel:%28%2B1%29%20%28816%29221-102=
4" value=3D"+18162211024" target=3D"_blank">(+1) (816)221-1024</a>.<br>


</div></font></span></div>

</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div>Directo=
r of Data Science</div><div><a href=3D"http://www.cloudera.com" target=3D"_=
blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/josh_w=
ills" target=3D"_blank">@josh_wills</a></div>


</div>

--089e0160a432a1d7a004e5f6f0d5--