asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Questions of building record in AsterixDB
Date Mon, 02 May 2016 14:05:07 GMT
Agreed.  Let's maximize correctness and minimize code and therefore avoid  additional room
for problems and the need for corresponding tests and dual maintenance of principles.  We
can then shoot to reduce object generation but still with two stages of types (initial and
casted- and any improvements to casting efficiency, within casting code itself, will benefit
all data adding code paths).  Then we can measure performance.  In the early days of Hyracks
and AsterixDB, Vinayak was a stickler for trying to avoid premature optimizations that had
software engineering and/or complexity impacts ... Let's try to keep that philosophy alive!

Sent from my iPad

On May 1, 2016, at 2:38 PM, Yingyi Bu <buyingyi@gmail.com> wrote:

>>> - Using The SerializerDeserializer option, you will only create a single
>>> object regardless of the number of parsed records, so I wouldn't worry
>>> about it. Code maintainability takes precedence here IMO.
> 
> For scalar types, you're right.
> But for complex types, there are objects created in each
> serialize/deserialize call:
> 
> https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/ARecordSerializerDeserializer.java#L176
> https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/AOrderedListSerializerDeserializer.java#L106
> https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/AUnorderedListSerializerDeserializer.java#L107
> 
>>> - Right now, we parse missing values as null. Should that change?
> I'll take care of that in my change:
> -- if missing is in the closed part, it becomes null;
> -- if missing is in the open part,  it does not exist in the record.
> 
>>> I am for NOT using the cast record solution for the overhead it will add.
>>> but that is just me :)
> Yes, there would be some overhead if we parse every record as a type-less
> open record first.
> But I think correctness is more important. I'm not sure whether the current
> AdmParser can handle all general type-castable cases.
> 
> Best,
> Yingyi
> 
> 
>> On Sat, Apr 30, 2016 at 6:10 PM, Xikui Wang <xikuiw@uci.edu> wrote:
>> 
>> Hi Mike,
>> 
>> Thanks for pointing that out. I think I misunderstood the working mechanism
>> and misused the terms of 'dataset' and 'datatype'. Sorry about that.
>> 
>> Best,
>> Xikui
>> 
>>> On Sat, Apr 30, 2016 at 4:30 PM, Mike Carey <dtabass@gmail.com> wrote:
>>> 
>>> One nit:  This has nothing to do with any dataset definition, on the
>> parser
>>> side of things - it's the type parameter on the create feed DDL statement
>>> that should be the parser's guide.  (In general the optional function on
>>> the feed may change the type by the time the data reaches a dataset.)
>>>> On Apr 30, 2016 3:26 PM, "Xikui Wang" <xikuiw@uci.edu> wrote:
>>>> 
>>>> Hi Abdullah,
>>>> 
>>>> Actually I also have the concern that adding null-check for general
>> cases
>>>> will bring extra
>>>> overheads. Thus I plan to add the checking procedure after parser, but
>>>> before addTuple,
>>>> i.e.FeedRecordDataFlowController. But based on what I have seen so far,
>>> it
>>>> seems RecordType
>>>> is transparent to FeedRecordDataFlowController. So I am still
>>> investigating
>>>> that...
>>>> 
>>>> I saw the null check in ADM parser. That's actually a viable way to
>>> handle
>>>> that within the
>>>> parser scope. But I am looking for a slightly different solution. In my
>>>> perspective,
>>>> ADM parser assumes the input adm should conform with the dataset
>>>> definition.
>>>> Thus it's reasonable for it to throw a exception. For Tweetparser, if I
>>> saw
>>>> null value on non-null attribute, I will
>>>> discard the whole tweet directly, and may not even log it(as too many
>>>> tweets with null).
>>>> That's the reason why I want to put that in
>> FeedRecordDataFlowController,
>>>> since I didn't see
>>>> there is a good way to prevent record insert in parser except for throw
>>>> exception.
>>>> 
>>>> Not sure my opinion makes sense or not. Feel free to comment. :)
>>>> 
>>>> Best,
>>>> Xikui
>>>> 
>>>> On Sat, Apr 30, 2016 at 1:52 PM, abdullah alamoudi <bamousaa@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Adding a few points here:
>>>>> 
>>>>> My feeling is SerializerDeserializer offers another level of
>>> abstraction
>>>>> but with output I can write value directly without construct AType
>>>> object.
>>>>> I am wondering if there are any preferences over these two?
>>>>> 
>>>>> - Using The SerializerDeserializer option, you will only create a
>>> single
>>>>> object regardless of the number of parsed records, so I wouldn't
>> worry
>>>>> about it. Code maintainability takes precedence here IMO.
>>>>> - In addition to records and lists, UTF8StringSerializerDeserializer
>>> can
>>>> be
>>>>> stateful for the same reason (avoid creating lost of un-needed
>>> objects).
>>>> In
>>>>> fact, our parsers use the stateful UTF8StringSerializerDeserializer
>>>> since I
>>>>> noticed that using the stateless one creates lots of byte[] and
>>> triggers
>>>> GC
>>>>> over and over.
>>>>> - Right now, we parse missing values as null. Should that change?
>>>>> - There is definitely a check for nulls on non-nullable values at
>> least
>>>> in
>>>>> the ADM parser. There might be a bug however that makes it accept
>>>> explicit
>>>>> null values and that should be fixed.
>>>>> 
>>>>> I am for NOT using the cast record solution for the overhead it will
>>> add.
>>>>> but that is just me :)
>>>>> ~Abdullah.
>>>>> 
>>>>> 
>>>>>> On Sat, Apr 30, 2016 at 6:48 AM, Xikui Wang <xikuiw@uci.edu>
wrote:
>>>>>> 
>>>>>> Thank you Yingyi. I will try to figure out a solution from that
>>>>> direction.
>>>>>> 
>>>>>> Best,
>>>>>> Xikui
>>>>>> 
>>>>>> On Fri, Apr 29, 2016 at 3:48 PM, Yingyi Bu <buyingyi@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Yeah, I think so:-)
>>>>>>> 
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>> 
>>>>>>> On Fri, Apr 29, 2016 at 3:46 PM, Mike Carey <dtabass@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> This indeed might be cleaner?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 4/29/16 3:28 PM, Yingyi Bu wrote:
>>>>>>>>> 
>>>>>>>>> I'm guessing that you can do similar things to
>>>> CastRecordDescriptor
>>>>>>>>>>> if you want to handle general cases in that region.
>>>>>>>>>> Or, you can inject a cast-record function in the
loading
>>> pipeline
>>>>>>>>> so that you can defer the runtime-type-check/cast to
that
>>> function
>>>>>>> instead
>>>>>>>>> of doing that in the parser.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Apr 29, 2016 at 3:25 PM, Yingyi Bu <
>> buyingyi@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> My answer is inlined.
>>>>>>>>>> 
>>>>>>>>>> My feeling is SerializerDeserializer offers another
level of
>>>>>>> abstraction
>>>>>>>>>>>> but with output I can write value directly
without
>> construct
>>>>> AType
>>>>>>>>>>> object.
>>>>>>>>>> 
>>>>>>>>>>> I am wondering if there are any preferences over
these two?
>>>>>>>>>>> I agree with you. However, a SerializerDeserializer
has to
>> be
>>>>>>> stateless,
>>>>>>>>>> hence it cannot be used at runtime for complex type
objects
>>> such
>>>> as
>>>>>>>>>> records and lists,
>>>>>>>>>> because it will create a lot Java objects.
>>>>>>>>>> 
>>>>>>>>>> in other words, parser has to guarantee that the
>>>>>>>>>>>> processed records has to match the dataset
>>>>> definition(non-optional
>>>>>>>>>>>> attribute cannot have null value). I tried
to assign null
>>> value
>>>>> to
>>>>>>>>>>> non-null
>>>>>>>>>> 
>>>>>>>>>>> attributes. It will be inserted successfully
but read
>> records
>>>> will
>>>>>>> have
>>>>>>>>>>>> problem.
>>>>>>>>>>> That sounds right to me.  Please file a JIRA
issue and
>> assign
>>> to
>>>>>> you (
>>>>>>>>>> if you're working on that).
>>>>>>>>>> I'm guessing that you can do similar things to
>>>> CastRecordDescriptor
>>>>>>>>>> if you want to handle general cases in that region.
>>>>>>>>>> 
>>>>>>>>>> 3. Set to null or skip
>>>>>>>>>>>> For optional(nullable) attributes, if I want
to insert a
>>> record
>>>>>> with
>>>>>>>>>>> null
>>>>>>>>>> 
>>>>>>>>>>> value on that attribute. Should I assign null
value or
>> should
>>> I
>>>>> just
>>>>>>>>>>> skip
>>>>>>>>>> 
>>>>>>>>>>> it? (Probably this is related to the missing
attribute that
>>>> Yingyi
>>>>>>>>>>>> mentioned today?)
>>>>>>>>>>> Assign null value.
>>>>>>>>>> Missing means the field doesn't exist in a record
at all.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Yingyi
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Apr 29, 2016 at 2:06 PM, Xikui Wang <xikuiw@uci.edu>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi devs,
>>>>>>>>>>> 
>>>>>>>>>>> I came across several questions while I was constructing
>>> records
>>>>> in
>>>>>>>>>>> AsterixDB.  Hope someone can help me clear the
confusion. :)
>>>>>>>>>>> 
>>>>>>>>>>> 1. Write directly to data output or use
>> SerializerDeserializer
>>>>>>>>>>> I am working with AbstractDataParser now. I see
people using
>>>>>> different
>>>>>>>>>>> ways
>>>>>>>>>>> to append attributes to data output. Either use:
>>>>>>>>>>> output.Write(typetag.serialize());
>>>>>>>>>>> output.WriteInt(0);
>>>>>>>>>>> to write into data output directly, or
>>>>>>>>>>> use AInt8SerializerDeserializer.serialize(int8Serde)
to
>>>> serialize
>>>>> a
>>>>>>>>>>> AINT8
>>>>>>>>>>> instance to output. *SerializerDeserializer uses
writeByte
>> to
>>>>> write
>>>>>>>>>>> output.
>>>>>>>>>>> 
>>>>>>>>>>> My feeling is SerializerDeserializer offers another
level of
>>>>>>> abstraction
>>>>>>>>>>> but with output I can write value directly without
construct
>>>> AType
>>>>>>>>>>> object.
>>>>>>>>>>> I am wondering if there are any preferences over
these two?
>>>>>>>>>>> 
>>>>>>>>>>> 2. RecordType validation after parser but before
add to
>> frame?
>>>>>>>>>>> My observation is after parser finish writing
the output and
>>>> pass
>>>>> it
>>>>>>> to
>>>>>>>>>>> next level, there is no such validation that
checks whether
>>>>>>> non-optional
>>>>>>>>>>> field is null or not. In other words, parser
has to
>> guarantee
>>>> that
>>>>>> the
>>>>>>>>>>> processed records has to match the dataset
>>>> definition(non-optional
>>>>>>>>>>> attribute cannot have null value). I tried to
assign null
>>> value
>>>> to
>>>>>>>>>>> non-null
>>>>>>>>>>> attributes. It will be inserted successfully
but read
>> records
>>>> will
>>>>>>> have
>>>>>>>>>>> problem.
>>>>>>>>>>> 
>>>>>>>>>>> 3. Set to null or skip
>>>>>>>>>>> For optional(nullable) attributes, if I want
to insert a
>>> record
>>>>> with
>>>>>>>>>>> null
>>>>>>>>>>> value on that attribute. Should I assign null
value or
>> should
>>> I
>>>>> just
>>>>>>>>>>> skip
>>>>>>>>>>> it? (Probably this is related to the missing
attribute that
>>>> Yingyi
>>>>>>>>>>> mentioned today?)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Xikui
>> 

Mime
View raw message