Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 34F92E0C0 for ; Wed, 13 Feb 2013 16:19:27 +0000 (UTC) Received: (qmail 84269 invoked by uid 500); 13 Feb 2013 16:19:26 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 83924 invoked by uid 500); 13 Feb 2013 16:19:26 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 83888 invoked by uid 99); 13 Feb 2013 16:19:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 16:19:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.128.173 as permitted sender) Received: from [209.85.128.173] (HELO mail-ve0-f173.google.com) (209.85.128.173) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 16:19:20 +0000 Received: by mail-ve0-f173.google.com with SMTP id oz10so1232356veb.18 for ; Wed, 13 Feb 2013 08:19:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=BbeNd2cnUgfUjyQhk4ivJnE6aLV63eK7tW76+ZoJoJs=; b=WmfAS+FY8PlXyWZAxxuXdsfRchtuYk/S4NVCAIzFH4yFeu1Ttx5yED980moriDkbwa qCzIYnp6jbUkx3MdqJ8orc6hTbpZU1qbZJa8fVzuOrDCt28X2dFnZIwsXsSIX/dswFAV QnVtgzUZVrOReHWKtgJEN/tiaTLWgmJTmesG01OSDdJaFeYzStQhJKlxKU1hR30tBjT6 w4BmvEVT/srBEg+vpfvmqwDuZPkxtWQ2ge8CH+w+4U824U+3OKKbA7wdvTkJ6L/WjnC5 MXNRzEJPWRyNiEa/nstFxihOrw+RpWBv8SMDWGZ41uNOdnkZpHvCR2CXo+f0w3E0fjDd wRvQ== X-Received: by 10.52.21.175 with SMTP id w15mr25384025vde.100.1360772332172; Wed, 13 Feb 2013 08:18:52 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.249.199 with HTTP; Wed, 13 Feb 2013 08:18:32 -0800 (PST) In-Reply-To: References: From: Josh Wills Date: Wed, 13 Feb 2013 08:18:32 -0800 Message-ID: Subject: Re: multiple input files as pipeline source? To: crunch Content-Type: multipart/alternative; boundary=20cf307c9bba7ab6c804d59d7f70 X-Gm-Message-State: ALoCoQlExIgj59V6peUkuXHKqkCQPI4gYshYd60x4chjqRQDZQMwIZBcmGzn5Upy7tbA97fOYO1K X-Virus-Checked: Checked by ClamAV on apache.org --20cf307c9bba7ab6c804d59d7f70 Content-Type: text/plain; charset=ISO-8859-1 Yep, I would love that. On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech wrote: > Actually, while we're on the subject of small files and > CombineFileInputFormat... > > I believe Hive has a feature whereby CombineFileInputFormat is used > internally if it's required to read many small files to make the > resulting mapreduce jobs more efficient. Would it be worth looking > into whether Crunch could support this, too? > > > On 13 February 2013 15:27, Dave Beech wrote: > > thanks! > > > > On 13 February 2013 15:22, Victor Iacoban > wrote: > >> https://gist.github.com/viacoban/4945325 > >> > >> > >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech > wrote: > >> > >>> A gist would be great - thanks very much > >>> > >>> Dave > >>> > >>> On 13 February 2013 14:52, Victor Iacoban > >>> wrote: > >>> > Dave, > >>> > > >>> > How do you want this, copy pasted code into a gist or a reusable jar? > >>> > > >>> > --victor > >>> > > >>> > > >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech > >>> wrote: > >>> > > >>> >> Hi Victor, > >>> >> Any chance you could share your implementation of a Source that > reads > >>> >> from multiple paths? I've wanted this for a while but haven't found > >>> >> time to go ahead and write one myself! > >>> >> Thanks, > >>> >> Dave > >>> >> > >>> >> On 12 February 2013 23:07, Victor Iacoban > > >>> >> wrote: > >>> >> > Thanks J > >>> >> > > >>> >> > I could not extend the FileSourceImpl since it works with only one > >>> input > >>> >> > path, > >>> >> > but I implemented the Source interface directly and it appears to > do > >>> the > >>> >> > job, thx for the pointer > >>> >> > > >>> >> > -- victor > >>> >> > > >>> >> > > >>> >> > > >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills > > >>> >> wrote: > >>> >> > > >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You > can > >>> >> also > >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's > one > >>> >> you're > >>> >> >> going to be using a lot, or if there is custom configuration > >>> information > >>> >> >> required to use the InputFormat. > >>> >> >> > >>> >> >> J > >>> >> >> > >>> >> >> > >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > >>> >> victor.iacoban@gmail.com > >>> >> >> >wrote: > >>> >> >> > >>> >> >> > That's exactly what I have in the code not using Crunch API: > >>> >> >> > public class MultiSequenceFileInputFormat extends > >>> >> >> > CombineFileInputFormat { > >>> >> >> > ... > >>> >> >> > } > >>> >> >> > > >>> >> >> > Are you saying there is way to use my custom input format with > >>> Crunch? > >>> >> >> > > >>> >> >> > > >>> >> >> > > >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < > josh.wills@gmail.com> > >>> >> >> wrote: > >>> >> >> > > >>> >> >> > > Depends on the size of the files-- if there are a bunch of > tiny > >>> >> ones, > >>> >> >> it > >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala > >>> >> >> > > > >>> >> >> > > > >>> >> > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > >>> >> >> > > > >>> >> >> > > J > >>> >> >> > > > >>> >> >> > > > >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > >>> >> >> > victor.iacoban@gmail.com > >>> >> >> > > >wrote: > >>> >> >> > > > >>> >> >> > > > Thanks Josh, > >>> >> >> > > > Is there any performance penalty in unions, assuming that I > >>> have > >>> >> >> > several > >>> >> >> > > > hundreds of input files? > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < > >>> josh.wills@gmail.com > >>> >> > > >>> >> >> > > wrote: > >>> >> >> > > > > >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. > >>> >> >> > > > > > >>> >> >> > > > > PTable first = pipeline.read(new TableSource >>> >> >> V>(firstFile)); > >>> >> >> > > > > PTable second = ...; > >>> >> >> > > > > PTable union = first.union(second); > >>> >> >> > > > > > >>> >> >> > > > > etc. > >>> >> >> > > > > > >>> >> >> > > > > > >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > >>> >> >> > > > victor.iacoban@gmail.com > >>> >> >> > > > > >wrote: > >>> >> >> > > > > > >>> >> >> > > > > > Is there any support in crunch to use multiple sequence > >>> files > >>> >> as > >>> >> >> > > > pipeline > >>> >> >> > > > > > source? > >>> >> >> > > > > > something similar to standard MultipleInputs > >>> >> >> > > > > > > >>> >> >> > > > > > Thanks, > >>> >> >> > > > > > victor > >>> >> >> > > > > > > >>> >> >> > > > > > >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> > >>> > -- Director of Data Science Cloudera Twitter: @josh_wills --20cf307c9bba7ab6c804d59d7f70--