Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1DD1EE7A2 for ; Wed, 13 Feb 2013 15:28:24 +0000 (UTC) Received: (qmail 73101 invoked by uid 500); 13 Feb 2013 15:28:24 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 73018 invoked by uid 500); 13 Feb 2013 15:28:23 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 73006 invoked by uid 99); 13 Feb 2013 15:28:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 15:28:23 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.169] (HELO mail-ia0-f169.google.com) (209.85.210.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 15:28:16 +0000 Received: by mail-ia0-f169.google.com with SMTP id j5so1319031iaf.28 for ; Wed, 13 Feb 2013 07:27:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=nQf1aPJld7K8fwdMt0j7r9aErtZ5WvH9nCJyR7BsH0Y=; b=MdSq98fbnckzUSKbSKiOEIGAYnECnqIXWbfuB3arTqbKKp/9sDe6GVnT+aixOWwEfO z0KrcdVxFF3LqUKE8gwFGdToYdl5ZuAyJlBUSxlvMMzwwENFquaKOKJxffutgQ3MkSiC vOdAvia0Fzf1opsim7K3CoY6Ig32muc+A9Os5LRhy4/TJN3oKlXtEtnkqCcozH7E4OYF 3XgmlktQH4vTsLF1vQyKzMFymOA7n2+TBuHSA+L+ooh7ipcthSf87JcUUyJEOwB52e4h QVdiavyIV7lhJ8+QhFbDtJLhWSJrURsURpTeh4bBNaws9SaJ012CcxdT0wwqLVc9tWiT 1Rcw== MIME-Version: 1.0 X-Received: by 10.50.40.131 with SMTP id x3mr11877867igk.10.1360769275436; Wed, 13 Feb 2013 07:27:55 -0800 (PST) Received: by 10.43.134.69 with HTTP; Wed, 13 Feb 2013 07:27:55 -0800 (PST) In-Reply-To: References: Date: Wed, 13 Feb 2013 15:27:55 +0000 Message-ID: Subject: Re: multiple input files as pipeline source? From: Dave Beech To: crunch-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQmOxMAgxdmBt8thkzo1/oTm56xGoYQR1P5dxwFVIN1Esjvk7zFJZSqPDctWQHBq6U7tbqeB X-Virus-Checked: Checked by ClamAV on apache.org thanks! On 13 February 2013 15:22, Victor Iacoban wrote: > https://gist.github.com/viacoban/4945325 > > > On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech wrote: > >> A gist would be great - thanks very much >> >> Dave >> >> On 13 February 2013 14:52, Victor Iacoban >> wrote: >> > Dave, >> > >> > How do you want this, copy pasted code into a gist or a reusable jar? >> > >> > --victor >> > >> > >> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech >> wrote: >> > >> >> Hi Victor, >> >> Any chance you could share your implementation of a Source that reads >> >> from multiple paths? I've wanted this for a while but haven't found >> >> time to go ahead and write one myself! >> >> Thanks, >> >> Dave >> >> >> >> On 12 February 2013 23:07, Victor Iacoban >> >> wrote: >> >> > Thanks J >> >> > >> >> > I could not extend the FileSourceImpl since it works with only one >> input >> >> > path, >> >> > but I implemented the Source interface directly and it appears to do >> the >> >> > job, thx for the pointer >> >> > >> >> > -- victor >> >> > >> >> > >> >> > >> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills >> >> wrote: >> >> > >> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You can >> >> also >> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one >> >> you're >> >> >> going to be using a lot, or if there is custom configuration >> information >> >> >> required to use the InputFormat. >> >> >> >> >> >> J >> >> >> >> >> >> >> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < >> >> victor.iacoban@gmail.com >> >> >> >wrote: >> >> >> >> >> >> > That's exactly what I have in the code not using Crunch API: >> >> >> > public class MultiSequenceFileInputFormat extends >> >> >> > CombineFileInputFormat { >> >> >> > ... >> >> >> > } >> >> >> > >> >> >> > Are you saying there is way to use my custom input format with >> Crunch? >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills >> >> >> wrote: >> >> >> > >> >> >> > > Depends on the size of the files-- if there are a bunch of tiny >> >> ones, >> >> >> it >> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala >> >> >> > > >> >> >> > > >> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> >> >> > > >> >> >> > > J >> >> >> > > >> >> >> > > >> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> >> >> > victor.iacoban@gmail.com >> >> >> > > >wrote: >> >> >> > > >> >> >> > > > Thanks Josh, >> >> >> > > > Is there any performance penalty in unions, assuming that I >> have >> >> >> > several >> >> >> > > > hundreds of input files? >> >> >> > > > >> >> >> > > > >> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < >> josh.wills@gmail.com >> >> > >> >> >> > > wrote: >> >> >> > > > >> >> >> > > > > Yeah, of course-- that's how stuff like joins work. >> >> >> > > > > >> >> >> > > > > PTable first = pipeline.read(new TableSource> >> >> V>(firstFile)); >> >> >> > > > > PTable second = ...; >> >> >> > > > > PTable union = first.union(second); >> >> >> > > > > >> >> >> > > > > etc. >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> >> >> > > > victor.iacoban@gmail.com >> >> >> > > > > >wrote: >> >> >> > > > > >> >> >> > > > > > Is there any support in crunch to use multiple sequence >> files >> >> as >> >> >> > > > pipeline >> >> >> > > > > > source? >> >> >> > > > > > something similar to standard MultipleInputs >> >> >> > > > > > >> >> >> > > > > > Thanks, >> >> >> > > > > > victor >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >>