Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E7C1F11028 for ; Wed, 4 Jun 2014 11:28:01 +0000 (UTC) Received: (qmail 63567 invoked by uid 500); 4 Jun 2014 11:28:01 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 63535 invoked by uid 500); 4 Jun 2014 11:28:01 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 63524 invoked by uid 99); 4 Jun 2014 11:28:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 11:28:01 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gabriel.reid@gmail.com designates 74.125.82.52 as permitted sender) Received: from [74.125.82.52] (HELO mail-wg0-f52.google.com) (74.125.82.52) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 11:27:57 +0000 Received: by mail-wg0-f52.google.com with SMTP id l18so8135197wgh.11 for ; Wed, 04 Jun 2014 04:27:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=rjOJMW6+Bh6B/3kmBsOyeqDKCdmZxleBXt5ItE24gko=; b=w3NHtguhmqoD/LfFoXXXUVp83DMAI9RwwB2aIy2NTGKYvPuCzYaQEvPf2ifR2WSg4f zoTqAulMuFVcBViVjVj47qEM0S2QrIp5doJ2aR5rdTZeuSilk6tm7ziqdXjcCEtmL9FN EHfuwCtbEgEUq5AIEa/lhJ3T1SBuKsmnhF2Kjo0JZiikRfa/vuoaT38HbZcooKFRfAZh K/HHAbwCYvUgLzUVmn+rTiRb9DSJxC2f2GstjVVspsYWPyYalS7w3UzLFgR5VRNxOyIy i9kCuE2N89QJVOjr5v8JAuVwk4xqAq6YRS9EtTswGgLJglmLBzJ1p6AWIEWYbf5gjtBh 2EiQ== MIME-Version: 1.0 X-Received: by 10.180.75.102 with SMTP id b6mr4393599wiw.26.1401881253735; Wed, 04 Jun 2014 04:27:33 -0700 (PDT) Received: by 10.194.176.226 with HTTP; Wed, 4 Jun 2014 04:27:33 -0700 (PDT) In-Reply-To: References: Date: Wed, 4 Jun 2014 13:27:33 +0200 Message-ID: Subject: Re: Simulating Avro reads in MemPipeline From: Gabriel Reid To: dev Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Hi David, Yeah, the whole object reuse situation in MapReduce is quite a drag. By the way, this isn't specific to Avro -- it's just as big of an issue with Writables. I'm kind of torn on the idea of building this into MemPipeline. On the one hand, the closer the behavior of different pipeline implementations are, the better, and I think MemPipeline is currently indeed largely used for fast testing of MR workflows. On the other hand, the MemPipeline can also be seen as an optimized implementation for running a pipeline faster if everything fits in memory, so adding in extra serialization/deserialization just to be similar to the MRPipeline would be unfortunately. Additionally, there's also the SparkPipeline implementation to consider -- I assume that spark doesn't have this same object reuse situation, although I'm not actually sure. I think that my general feeling is that I'd rather not add object reuse into MemPipeline, but I do think it would be great if we could bring the behavior of the two implementations in line. Any other thoughts on how we could do this? - Gabriel On Wed, Jun 4, 2014 at 11:51 AM, David Whiting wrote: > We are facing some problems where errors are not found with MemPipeline > tests which do happen on the cluster due to the way Avro reuses objects for > reading; meaning that if you do a parallelDo and a PGroupedTable, then the > Iterable of values you get is actually the same object returned to you with > different contents. If you then try and store certain instances of this to > be emitted later, then you will get unexpected results without taking a > copy. > > To try and catch these problems on the local side, we've written a wrapper > for Iterable which uses the SpecificDatumWriter > and SpecificDatumReader to write to and from ByteBuffers for each iteration > and offering the last record for reuse. This allows us to run the offending > MapFn or DoFn in isolation with a wrapped Iterable as input to identify and > fix the problem. > > This, however, is a little awkward and it would be much neater if this was > driven from the Crunch side. At the point in the MapShuffler where the > Iterable is wrapped in a SingleUseIterable, it could also be wrapped in one > of these AvroReadSimulatingIterable, meaning that people could find these > problems before even running them live for the first time. However, this is > a problem that is specific to Avro so it obviously makes no sense to hack > it right into HFunction. > > Anyone have any thoughts on how this could be integrated nicely, or indeed > if it should be integrated at all?