Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4BFE6108DE for ; Mon, 9 Sep 2013 18:20:41 +0000 (UTC) Received: (qmail 77689 invoked by uid 500); 9 Sep 2013 18:20:40 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 77624 invoked by uid 500); 9 Sep 2013 18:20:40 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 77615 invoked by uid 99); 9 Sep 2013 18:20:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Sep 2013 18:20:40 +0000 X-ASF-Spam-Status: No, hits=3.6 required=5.0 tests=HTML_MESSAGE,HTTP_ESCAPED_HOST,RCVD_IN_DNSWL_LOW,SPF_PASS,URI_NOVOWEL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jwills@cloudera.com designates 209.85.215.41 as permitted sender) Received: from [209.85.215.41] (HELO mail-la0-f41.google.com) (209.85.215.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Sep 2013 18:20:34 +0000 Received: by mail-la0-f41.google.com with SMTP id ec20so5409026lab.28 for ; Mon, 09 Sep 2013 11:20:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=bNTzqbmWVCGKJrDq1PLmkIF/PPedLmWTZ3eIYUYrhzA=; b=Pqpfrdn6NJcdQWLUydWue2tMgP9sGifje2hp9yuOFe69h+Oi0ELwcse1w262JoyTsZ MiGlUM+yB7CBJUEGkNatirsB0kKCicnVQNoJhWjvlZ16EhtBtfqgYwbz81dKcMbY1C1m ssHHiKU+MIGVsKC0n+jRum1qHHxjpkxNolh424Cp3SbGeaHKZrBTP+VAkqvk77dNXnN3 ZfxAf2dadH0NpS0/SM48D01IOblTGrq33aVSNaHw8FOAACzC7YoJEa/8uV/fDDmijrd0 gi8uVJPPtbBJ79EF32sNA1H1V9VgCMtSjODIY0WrxytNfVXmud/Zh7bZgUweONgJH9EO +cag== X-Gm-Message-State: ALoCoQn98ASYZfeoEySiOBDLY3xYZWTGuAINMeEyvHC1c/EklOdNZkVXksXfizf4NyYUSnZtF6XO X-Received: by 10.112.161.105 with SMTP id xr9mr2194520lbb.40.1378750813601; Mon, 09 Sep 2013 11:20:13 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.83.161 with HTTP; Mon, 9 Sep 2013 11:19:53 -0700 (PDT) In-Reply-To: <2493CAF423E7104B9F5C6D863AEAD3C00DB391B2@CERNMSGLS5MB3A.cerner.net> References: <2493CAF423E7104B9F5C6D863AEAD3C00DB391B2@CERNMSGLS5MB3A.cerner.net> From: Josh Wills Date: Mon, 9 Sep 2013 11:19:53 -0700 Message-ID: Subject: Re: Writing MapFile through Crunch, issue reading through Hadoop To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a11c37d647a8d2b04e5f770d8 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c37d647a8d2b04e5f770d8 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Looking at the MapFileOutputFormat API, not that I can tell. On Mon, Sep 9, 2013 at 11:18 AM, Hansen,Chuck wrot= e: > Thanks for the quick reply Josh. Is there a way I could use a > PathFilter when creating the MapFile.Reader[] array? > > MapFile.Reader[] readers =3D MapFileOutputFormat.*getReaders*(*new* Path= ( > MAPFILE_LOCATION), conf); > > > -- > *Chuck Hansen* > Software Engineer, Record Dev > chuck.hansen@cerner.com | 816-201-9629 > Cerner Corporation | www.cerner.com > > From: Josh Wills > Reply-To: "user@crunch.apache.org" > Date: Monday, September 9, 2013 12:44 PM > To: "user@crunch.apache.org" > Subject: Re: Writing MapFile through Crunch, issue reading through Hadoop > > Tough to assign blame here-- writing a _SUCCESS bit is usually a good > thing, and most Hadoop file formats are smart about filtering out files > that start with "_" or ".", or allowing you to specify an instance of > PathFilter that can be used to ignore hidden files. > > One way around this would be to add an option to Targets that would > disable writing the _SUCCESS flag, which would be part of a more general > change to allow per-Source and per-Target configuration options. For > example, you could specify that some outputs of an MR job were compressed > using gzip, and others were compressed using Snappy, instead of having a > single compression strategy for everything. > > > > On Mon, Sep 9, 2013 at 10:28 AM, Hansen,Chuck wr= ote: > >> With Crunch versions prior to 0.7.x, there does not appear to be an >> _SUCCESS file written upon completion, starting with 0.7.x there is. Th= is >> file (and any others not intended to be read through [1]) appears to cau= se >> issue with [1]. This means writing a MapFile with crunch and reading ba= ck >> with [1] works prior to 0.7.x, but starting with 0.7.x, [1] will throw a= n >> exception. >> >> Is this a bug with Crunch and/or Hadoop? >> >> [1] org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.* >> getReaders* >> * >> * >> Hadoop CDH versions used: >> >> 2.0.0-mr1-cdh4.2.1 >> >> 2.0.0-cdh4.2.1> hadoop_commonAndHDFSVersion> >> >> -- >> *Chuck Hansen* >> Software Engineer, Record Dev >> chuck.hansen@cerner.com | 816-201-9629 >> Cerner Corporation | www.cerner.com >> CONFIDENTIALITY NOTICE This message and any included attachments are >> from Cerner Corporation and are intended only for the addressee. The >> information contained in this message is confidential and may constitute >> inside or non-public information under international, federal, or state >> securities laws. Unauthorized forwarding, printing, copying, distributio= n, >> or use of such information is strictly prohibited and may be unlawful. I= f >> you are not the addressee, please promptly delete this message and notif= y >> the sender of the delivery error by e-mail or you may call Cerner's >> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >> > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --=20 Director of Data Science Cloudera Twitter: @josh_wills --001a11c37d647a8d2b04e5f770d8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Looking at the MapFileOutputFormat API, not that I can tel= l.


On Mo= n, Sep 9, 2013 at 11:18 AM, Hansen,Chuck <Chuck.Hansen@cerner.com> wrote:

From: Josh Wills <jwills@cloudera.com>
Reply-To: "user@crunch.apache.org" &= lt;user@crunch.= apache.org>
Date: Monday, September 9, 2013 12:= 44 PM
To: "user@crunch.apache.org" <user@crunch.apache= .org>
Subject: Re: Writing MapFile throug= h Crunch, issue reading through Hadoop

Tough to assign blame here-- writing a _SUCCESS bit is usu= ally a good thing, and most Hadoop file formats are smart about filtering o= ut files that start with "_" or ".", or allowing you to= specify an instance of PathFilter that can be used to ignore hidden files.

One way around this would be to add an option to Targets that would di= sable writing the _SUCCESS flag, which would be part of a more general chan= ge to allow per-Source and per-Target configuration options. For example, y= ou could specify that some outputs of an MR job were compressed using gzip, and others were compressed using = Snappy, instead of having a single compression strategy for everything.



On Mon, Sep 9, 2013 at 10:28 AM, Hansen,Chuck <Chuck.Hans= en@cerner.com> wrote:
With Crunch versions prior to 0.7.x, there does not appear to be an _S= UCCESS file written upon completion, starting with 0.7.x there is. =A0This = file (and any others not intended to be read through [1]) appears to cause = issue with [1]. =A0This means writing a MapFile with crunch and reading back with [1] works prior to 0.7.x, but = starting with 0.7.x, [1] will throw an exception.=A0

Is this a bug with Crunch and/or Hadoop?

[1]=A0org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.ge= tReaders

Hadoop CDH versions used:=A0

=A0 =A0=A0<hadoopCoreVersion>2.0.0-mr1-cdh4.2.1</hadoopCoreVersion>=A0

=A0 =A0 = <hadoop_commonAndHDFSVersion>2.0.0-cdh4.2.1</hadoop_commonAndHDFSVersion>=A0


--=A0
Chuck Hansen
Software Engineer, Record Dev<= /span>
Cerner Corporation |=A0www.cerner.com
CONFIDENTIALITY NOTICE This message and any included attachments are f= rom Cerner Corporation and are intended only for the addressee. The informa= tion contained in this message is confidential and may constitute inside or= non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copy= ing, distribution, or use of such information is strictly prohibited and ma= y be unlawful. If you are not the addressee, please promptly delete this me= ssage and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in= Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



--
Director of Data Science
Twitter: @josh_wills



--
Directo= r of Data Science
Twitter: @josh_wills
--001a11c37d647a8d2b04e5f770d8--