Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB28AE865 for ; Wed, 13 Feb 2013 08:59:47 +0000 (UTC) Received: (qmail 54048 invoked by uid 500); 13 Feb 2013 08:59:47 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 53970 invoked by uid 500); 13 Feb 2013 08:59:46 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 53950 invoked by uid 99); 13 Feb 2013 08:59:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 08:59:46 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.223.179] (HELO mail-ie0-f179.google.com) (209.85.223.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 08:59:40 +0000 Received: by mail-ie0-f179.google.com with SMTP id k11so1376620iea.10 for ; Wed, 13 Feb 2013 00:59:19 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=CD3tgaFbRX9YhohQwsC3tPOZsWkwl7+CRVEafmvhKm0=; b=mkIomyCqLqm5amtsicgU0bdRPqI8JICdc9xa8G9F+b8LZR1z6ALXgCVHWlZ32dwCQ8 1uwHCAMqkDjyqEAD89b4wyc8VnhRdyUAIohYMsGwivCvm1oPr1THhhZV7KTofsAr8ITK bG9MNoGIMh3aKJu12MPAhLw7VzolpHgUFNUbZ9N2EG6noba9kj1m/UVqlxU7tjp6USsR PfISqa2vF7YJUwJdwN06rin4G4QDHUFjr/Jz22Ee8JgM8VgMn+sSslX9tWt36xSceAWc I8LOVixeVEUT+wZgsmVReLoYln+Pk5dNTBsPY5SHZdU0aYuFAiqew1X9TK2dEnzlJmaG u5vQ== MIME-Version: 1.0 X-Received: by 10.50.40.131 with SMTP id x3mr9699795igk.10.1360745959028; Wed, 13 Feb 2013 00:59:19 -0800 (PST) Received: by 10.43.134.69 with HTTP; Wed, 13 Feb 2013 00:59:18 -0800 (PST) In-Reply-To: References: Date: Wed, 13 Feb 2013 08:59:18 +0000 Message-ID: Subject: Re: multiple input files as pipeline source? From: Dave Beech To: crunch-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnleHJAzGiU8qf2lZe/M9Jjpxkvmwh4KQ5d6PL0V7lNX3yQVOSNc8Iy45/jgdczRjpDXt0h X-Virus-Checked: Checked by ClamAV on apache.org Hi Victor, Any chance you could share your implementation of a Source that reads from multiple paths? I've wanted this for a while but haven't found time to go ahead and write one myself! Thanks, Dave On 12 February 2013 23:07, Victor Iacoban wrote: > Thanks J > > I could not extend the FileSourceImpl since it works with only one input > path, > but I implemented the Source interface directly and it appears to do the > job, thx for the pointer > > -- victor > > > > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills wrote: > >> Yep-- check out the formattedFile function in o.a.c.io.From. You can also >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one you're >> going to be using a lot, or if there is custom configuration information >> required to use the InputFormat. >> >> J >> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban > >wrote: >> >> > That's exactly what I have in the code not using Crunch API: >> > public class MultiSequenceFileInputFormat extends >> > CombineFileInputFormat { >> > ... >> > } >> > >> > Are you saying there is way to use my custom input format with Crunch? >> > >> > >> > >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills >> wrote: >> > >> > > Depends on the size of the files-- if there are a bunch of tiny ones, >> it >> > > can be worthwhile to have a CombineFileInputFormat, ala >> > > >> > > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> > > >> > > J >> > > >> > > >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> > victor.iacoban@gmail.com >> > > >wrote: >> > > >> > > > Thanks Josh, >> > > > Is there any performance penalty in unions, assuming that I have >> > several >> > > > hundreds of input files? >> > > > >> > > > >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills >> > > wrote: >> > > > >> > > > > Yeah, of course-- that's how stuff like joins work. >> > > > > >> > > > > PTable first = pipeline.read(new TableSource> V>(firstFile)); >> > > > > PTable second = ...; >> > > > > PTable union = first.union(second); >> > > > > >> > > > > etc. >> > > > > >> > > > > >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> > > > victor.iacoban@gmail.com >> > > > > >wrote: >> > > > > >> > > > > > Is there any support in crunch to use multiple sequence files as >> > > > pipeline >> > > > > > source? >> > > > > > something similar to standard MultipleInputs >> > > > > > >> > > > > > Thanks, >> > > > > > victor >> > > > > > >> > > > > >> > > > >> > > >> > >>