Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6005811A05 for ; Wed, 4 Jun 2014 16:03:00 +0000 (UTC) Received: (qmail 49746 invoked by uid 500); 4 Jun 2014 16:03:00 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 49714 invoked by uid 500); 4 Jun 2014 16:03:00 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 49703 invoked by uid 99); 4 Jun 2014 16:03:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 16:03:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of davidwhiting@gmail.com designates 74.125.82.174 as permitted sender) Received: from [74.125.82.174] (HELO mail-we0-f174.google.com) (74.125.82.174) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 16:02:57 +0000 Received: by mail-we0-f174.google.com with SMTP id k48so8909737wev.33 for ; Wed, 04 Jun 2014 09:02:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=pkj8cnQzfXvAWa2DsnCMjAGmxDefho3AFzXpSwrhoaI=; b=EoYHZXp5CUgPSJWug+V/OpRcinl/SC8bbNguxIYvvItf0xVuh2SS/9dAooyR3okVAx LzJQtmG9aNwVEjT6xyTl+iO6dcRDJYh9q/zGWOM+Qb2X6O2ltofBksB48TK329H4cJim 2rELfbWuRPoMw2gM12e3GZgLB6Xwg4leVa8UI4JwLtJWyvRPqFysuOdIOQTVsDo5n2wD 2YIpliACXthW+8BhBExePB+Q9KAG8mg+nhEpFy3sZqcgFFwj82zm/sZuFK42nRAI4XlK f3xtBp/L8PcVaobe8XmoHIOLnW4oecPDeYhxlVGZ2csMozmE/8n+5o6DDYz3x0xQ/lzh eaeg== MIME-Version: 1.0 X-Received: by 10.180.126.98 with SMTP id mx2mr6511271wib.55.1401897753433; Wed, 04 Jun 2014 09:02:33 -0700 (PDT) Received: by 10.194.185.203 with HTTP; Wed, 4 Jun 2014 09:02:33 -0700 (PDT) In-Reply-To: References: Date: Wed, 4 Jun 2014 18:02:33 +0200 Message-ID: Subject: Re: Simulating Avro reads in MemPipeline From: David Whiting To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=e89a8f8389959ae26804fb04c1de X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f8389959ae26804fb04c1de Content-Type: text/plain; charset=UTF-8 Let me see if I can transform the thing I've got already into a more suitable shape On 4 June 2014 17:10, Josh Wills wrote: > Tracking this here: https://issues.apache.org/jira/browse/CRUNCH-412 > > David, do you have a patch that would be easy to adapt here? > > > On Wed, Jun 4, 2014 at 8:05 AM, Gabriel Reid > wrote: > > > On Wed, Jun 4, 2014 at 4:53 PM, Josh Wills wrote: > > > As long as we're on the subject, the fact that we don't > > > serialize/deserialize DoFns before we run MemPipelines has burned me a > > few > > > times when I would make a change and forget about the serialization > > > implications until I tried to run the job. One of the things I like > about > > > the local Spark implementation is that it does this serialization check > > for > > > you. So if we were to have some mode that could be enabled to allow us > to > > > simulate DoFn serialization and object re-use situations locally, even > at > > > the cost of extra runtime overhead, I would be happy. > > > > Yeah, that sounds like a good plan -- it could probably be done via an > > overload of MemPipeline.getInstance() that would take a boolean flag > > to replication MR or not. > > > > > > > > > > > > > On Wed, Jun 4, 2014 at 4:27 AM, Gabriel Reid > > wrote: > > > > > >> Hi David, > > >> > > >> Yeah, the whole object reuse situation in MapReduce is quite a drag. > > >> By the way, this isn't specific to Avro -- it's just as big of an > > >> issue with Writables. > > >> > > >> I'm kind of torn on the idea of building this into MemPipeline. On the > > >> one hand, the closer the behavior of different pipeline > > >> implementations are, the better, and I think MemPipeline is currently > > >> indeed largely used for fast testing of MR workflows. > > >> > > >> On the other hand, the MemPipeline can also be seen as an optimized > > >> implementation for running a pipeline faster if everything fits in > > >> memory, so adding in extra serialization/deserialization just to be > > >> similar to the MRPipeline would be unfortunately. Additionally, > > >> there's also the SparkPipeline implementation to consider -- I assume > > >> that spark doesn't have this same object reuse situation, although I'm > > >> not actually sure. > > >> > > >> I think that my general feeling is that I'd rather not add object > > >> reuse into MemPipeline, but I do think it would be great if we could > > >> bring the behavior of the two implementations in line. Any other > > >> thoughts on how we could do this? > > >> > > >> - Gabriel > > >> > > >> > > >> On Wed, Jun 4, 2014 at 11:51 AM, David Whiting < > davidwhiting@gmail.com> > > >> wrote: > > >> > We are facing some problems where errors are not found with > > MemPipeline > > >> > tests which do happen on the cluster due to the way Avro reuses > > objects > > >> for > > >> > reading; meaning that if you do a parallelDo and a PGroupedTable, > then > > >> the > > >> > Iterable of values you get is actually the same object returned to > you > > >> with > > >> > different contents. If you then try and store certain instances of > > this > > >> to > > >> > be emitted later, then you will get unexpected results without > taking > > a > > >> > copy. > > >> > > > >> > To try and catch these problems on the local side, we've written a > > >> wrapper > > >> > for Iterable which uses the > > SpecificDatumWriter > > >> > and SpecificDatumReader to write to and from ByteBuffers for each > > >> iteration > > >> > and offering the last record for reuse. This allows us to run the > > >> offending > > >> > MapFn or DoFn in isolation with a wrapped Iterable as input to > > identify > > >> and > > >> > fix the problem. > > >> > > > >> > This, however, is a little awkward and it would be much neater if > this > > >> was > > >> > driven from the Crunch side. At the point in the MapShuffler where > the > > >> > Iterable is wrapped in a SingleUseIterable, it could also be wrapped > > in > > >> one > > >> > of these AvroReadSimulatingIterable, meaning that people could find > > these > > >> > problems before even running them live for the first time. However, > > this > > >> is > > >> > a problem that is specific to Avro so it obviously makes no sense to > > hack > > >> > it right into HFunction. > > >> > > > >> > Anyone have any thoughts on how this could be integrated nicely, or > > >> indeed > > >> > if it should be integrated at all? > > >> > > > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --e89a8f8389959ae26804fb04c1de--