Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1303E19313 for ; Sat, 30 Apr 2016 22:26:04 +0000 (UTC) Received: (qmail 63015 invoked by uid 500); 30 Apr 2016 22:26:04 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 62955 invoked by uid 500); 30 Apr 2016 22:26:04 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 62943 invoked by uid 99); 30 Apr 2016 22:26:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Apr 2016 22:26:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 09F2B1A0559 for ; Sat, 30 Apr 2016 22:26:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.98 X-Spam-Level: * X-Spam-Status: No, score=1.98 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=uci-edu.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 59kSh9TwsAw4 for ; Sat, 30 Apr 2016 22:26:00 +0000 (UTC) Received: from mail-yw0-f171.google.com (mail-yw0-f171.google.com [209.85.161.171]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3E15A5F254 for ; Sat, 30 Apr 2016 22:25:59 +0000 (UTC) Received: by mail-yw0-f171.google.com with SMTP id j74so202587749ywg.1 for ; Sat, 30 Apr 2016 15:25:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uci-edu.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=VuSrAA06VE3owBVsZzyh6bPf1DuuubWE1CxI7R7BNH4=; b=bFZ2quqEgF6nxJ/S6gwYgzNU7qhKK659coFBeIvtmexh+fplAeTTppcQUFcyvcU+LL OHfvGMMnISlMZGEOG9bWoe7HlaDnITW+iZKb0xCtzWrejICX5g62zUlXRKJknh/Ky0+3 i8c8Bej+1aQLKDuy0p3ezi2LUgu1VHsVi18LSzkzT7Y2W1WA7YECMMPT3clsk5aG3D1T bn8Obzhn8oiyl8OiO4FVPl8ub8zJxjhK6RDCJXrEFrVFUveGP0WSeLsSUpdxU3E+Py5E xSbjimwv56GI3CXVGiv0sg6SfYhCZb+Hl0/puSGPSw1SK3L6dx56krAiUOZcYjcMj32Q SWLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=VuSrAA06VE3owBVsZzyh6bPf1DuuubWE1CxI7R7BNH4=; b=dOCKnZGLk5olsnzIvZfphYxveq9PrLcMi2jIYbXLjyGToa6UMq3HCHzTuGVgOVHR/q H22i0cM002eGAksd80FQ8l1GVwyvCxIsgG530/AKG7cQnpBMhEqf4Xjv0fJjNUi19oc8 4M9YGY4rirVxcRYnaQeb6zRYey7RrcR3TLaGoaa04acMf9PpyrzuELDg+ubA9gIOKilB opcrr6UgKzWvxuugIKYmP5fWRhyLd9UrJIKagGjzWErA+ttQRiY2W3Rynnqz41tH8P7L cuR2tb0ZLaX/fw9ipaZ3+5Jf7vD89KZqHJGIDEZBMFNLVvesIap7ed4udqbT4GX4owqI QxFw== X-Gm-Message-State: AOPr4FVBB/RM6XIsmsuCjzs8juPz9+RLdBz9pTugS+VfHssdHZGXTdFeNcx1G1WIxtOmhT6k1Xq5hkKG2vhXmQ== MIME-Version: 1.0 X-Received: by 10.37.64.149 with SMTP id n143mr12649694yba.166.1462055158108; Sat, 30 Apr 2016 15:25:58 -0700 (PDT) Received: by 10.129.146.199 with HTTP; Sat, 30 Apr 2016 15:25:58 -0700 (PDT) In-Reply-To: References: <5723E463.4060704@gmail.com> Date: Sat, 30 Apr 2016 15:25:58 -0700 Message-ID: Subject: Re: Questions of building record in AsterixDB From: Xikui Wang To: dev@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c00b40578aaf0531bb3ebd --001a11c00b40578aaf0531bb3ebd Content-Type: text/plain; charset=UTF-8 Hi Abdullah, Actually I also have the concern that adding null-check for general cases will bring extra overheads. Thus I plan to add the checking procedure after parser, but before addTuple, i.e.FeedRecordDataFlowController. But based on what I have seen so far, it seems RecordType is transparent to FeedRecordDataFlowController. So I am still investigating that... I saw the null check in ADM parser. That's actually a viable way to handle that within the parser scope. But I am looking for a slightly different solution. In my perspective, ADM parser assumes the input adm should conform with the dataset definition. Thus it's reasonable for it to throw a exception. For Tweetparser, if I saw null value on non-null attribute, I will discard the whole tweet directly, and may not even log it(as too many tweets with null). That's the reason why I want to put that in FeedRecordDataFlowController, since I didn't see there is a good way to prevent record insert in parser except for throw exception. Not sure my opinion makes sense or not. Feel free to comment. :) Best, Xikui On Sat, Apr 30, 2016 at 1:52 PM, abdullah alamoudi wrote: > Adding a few points here: > > My feeling is SerializerDeserializer offers another level of abstraction > but with output I can write value directly without construct AType object. > I am wondering if there are any preferences over these two? > > - Using The SerializerDeserializer option, you will only create a single > object regardless of the number of parsed records, so I wouldn't worry > about it. Code maintainability takes precedence here IMO. > - In addition to records and lists, UTF8StringSerializerDeserializer can be > stateful for the same reason (avoid creating lost of un-needed objects). In > fact, our parsers use the stateful UTF8StringSerializerDeserializer since I > noticed that using the stateless one creates lots of byte[] and triggers GC > over and over. > - Right now, we parse missing values as null. Should that change? > - There is definitely a check for nulls on non-nullable values at least in > the ADM parser. There might be a bug however that makes it accept explicit > null values and that should be fixed. > > I am for NOT using the cast record solution for the overhead it will add. > but that is just me :) > ~Abdullah. > > > On Sat, Apr 30, 2016 at 6:48 AM, Xikui Wang wrote: > > > Thank you Yingyi. I will try to figure out a solution from that > direction. > > > > Best, > > Xikui > > > > On Fri, Apr 29, 2016 at 3:48 PM, Yingyi Bu wrote: > > > > > Yeah, I think so:-) > > > > > > Best, > > > Yingyi > > > > > > On Fri, Apr 29, 2016 at 3:46 PM, Mike Carey wrote: > > > > > > > This indeed might be cleaner? > > > > > > > > > > > > On 4/29/16 3:28 PM, Yingyi Bu wrote: > > > > > > > >> I'm guessing that you can do similar things to CastRecordDescriptor > > > >>>> if you want to handle general cases in that region. > > > >>>> > > > >>> Or, you can inject a cast-record function in the loading pipeline > > > >> so that you can defer the runtime-type-check/cast to that function > > > instead > > > >> of doing that in the parser. > > > >> > > > >> > > > >> On Fri, Apr 29, 2016 at 3:25 PM, Yingyi Bu > > wrote: > > > >> > > > >> My answer is inlined. > > > >>> > > > >>> My feeling is SerializerDeserializer offers another level of > > > abstraction > > > >>>>> but with output I can write value directly without construct > AType > > > >>>>> > > > >>>> object. > > > >>> > > > >>>> I am wondering if there are any preferences over these two? > > > >>>>> > > > >>>> I agree with you. However, a SerializerDeserializer has to be > > > stateless, > > > >>> hence it cannot be used at runtime for complex type objects such as > > > >>> records and lists, > > > >>> because it will create a lot Java objects. > > > >>> > > > >>> in other words, parser has to guarantee that the > > > >>>>> processed records has to match the dataset > definition(non-optional > > > >>>>> attribute cannot have null value). I tried to assign null value > to > > > >>>>> > > > >>>> non-null > > > >>> > > > >>>> attributes. It will be inserted successfully but read records will > > > have > > > >>>>> problem. > > > >>>>> > > > >>>> That sounds right to me. Please file a JIRA issue and assign to > > you ( > > > >>> if you're working on that). > > > >>> I'm guessing that you can do similar things to CastRecordDescriptor > > > >>> if you want to handle general cases in that region. > > > >>> > > > >>> 3. Set to null or skip > > > >>>>> For optional(nullable) attributes, if I want to insert a record > > with > > > >>>>> > > > >>>> null > > > >>> > > > >>>> value on that attribute. Should I assign null value or should I > just > > > >>>>> > > > >>>> skip > > > >>> > > > >>>> it? (Probably this is related to the missing attribute that Yingyi > > > >>>>> mentioned today?) > > > >>>>> > > > >>>> Assign null value. > > > >>> Missing means the field doesn't exist in a record at all. > > > >>> > > > >>> Best, > > > >>> Yingyi > > > >>> > > > >>> > > > >>> On Fri, Apr 29, 2016 at 2:06 PM, Xikui Wang > wrote: > > > >>> > > > >>> Hi devs, > > > >>>> > > > >>>> I came across several questions while I was constructing records > in > > > >>>> AsterixDB. Hope someone can help me clear the confusion. :) > > > >>>> > > > >>>> 1. Write directly to data output or use SerializerDeserializer > > > >>>> I am working with AbstractDataParser now. I see people using > > different > > > >>>> ways > > > >>>> to append attributes to data output. Either use: > > > >>>> output.Write(typetag.serialize()); > > > >>>> output.WriteInt(0); > > > >>>> to write into data output directly, or > > > >>>> use AInt8SerializerDeserializer.serialize(int8Serde) to serialize > a > > > >>>> AINT8 > > > >>>> instance to output. *SerializerDeserializer uses writeByte to > write > > > >>>> output. > > > >>>> > > > >>>> My feeling is SerializerDeserializer offers another level of > > > abstraction > > > >>>> but with output I can write value directly without construct AType > > > >>>> object. > > > >>>> I am wondering if there are any preferences over these two? > > > >>>> > > > >>>> 2. RecordType validation after parser but before add to frame? > > > >>>> My observation is after parser finish writing the output and pass > it > > > to > > > >>>> next level, there is no such validation that checks whether > > > non-optional > > > >>>> field is null or not. In other words, parser has to guarantee that > > the > > > >>>> processed records has to match the dataset definition(non-optional > > > >>>> attribute cannot have null value). I tried to assign null value to > > > >>>> non-null > > > >>>> attributes. It will be inserted successfully but read records will > > > have > > > >>>> problem. > > > >>>> > > > >>>> 3. Set to null or skip > > > >>>> For optional(nullable) attributes, if I want to insert a record > with > > > >>>> null > > > >>>> value on that attribute. Should I assign null value or should I > just > > > >>>> skip > > > >>>> it? (Probably this is related to the missing attribute that Yingyi > > > >>>> mentioned today?) > > > >>>> > > > >>>> Thanks for your help. > > > >>>> > > > >>>> Best, > > > >>>> Xikui > > > >>>> > > > >>>> > > > >>> > > > > > > > > > > --001a11c00b40578aaf0531bb3ebd--