drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: Avro deserialization bug - 1.3-SNAPSHOT
Date Fri, 13 Nov 2015 22:58:27 GMT
So,

Could someone point me to the appropriate place in the Drill code to start
investigating this (We would love to contribute but getting up to speed is
a bit much).

I realize that there are many good things happening and that v. 1.3 is
around the corner but it seems that I incorrectly assumed that data
corruption issues would get a higher priority or that I would, at the very
least, get someone to confirm such a bug.

We are now impeded by this after having moved all our logging from JSON to
Avro to avoid the schema related problems we have been running into with
the JSON reader (null interpreted like double and failing when a string
eventually comes along) .

- Stefan


On Wed, Nov 11, 2015 at 10:14 PM, Stefán Baxter <stefan@activitystream.com>
wrote:

> Hi,
>
> Can someone please verify that this is in fact a bug so I can rule out our
> own mistakes?
>
> We have recently moved all our logging to Avro to compensate for schema
> differences in JSON that were causing various problems and our latest
> release is now impeded with this.
> Alternatively can someone please point me in the right direction if I was
> to try to fix this myself.
>
> Regards,
>   -Stefán
>
> On Tue, Nov 10, 2015 at 2:41 PM, Stefán Baxter <stefan@activitystream.com>
> wrote:
>
>> Thank you Kamesh.
>>
>> I have created https://issues.apache.org/jira/browse/DRILL-4056 with the
>> description.
>> I will send you a confidential test file to your private email.
>>
>> Regards,
>>  -Stefan
>>
>> On Tue, Nov 10, 2015 at 2:30 PM, Kamesh <kamesh.hadoop@gmail.com> wrote:
>>
>>> Hi Stefán,
>>>  Could you please raise a Jira with sample schema and sample input to
>>> reproduce it. I will look into this.
>>>
>>> On Tue, Nov 10, 2015 at 7:55 PM, Stefán Baxter <
>>> stefan@activitystream.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I have an Avro file that support the following data/schema:
>>> >
>>> > {"field":"some", "classification":{"variant":"Gæst"}}
>>> >
>>> > When I select 10 rows from this file I get:
>>> >
>>> > +---------------------+
>>> > |       EXPR$0        |
>>> > +---------------------+
>>> > | Gæst                |
>>> > | Voksen              |
>>> > | Voksen              |
>>> > | Invitation KIF KBH  |
>>> > | Invitation KIF KBH  |
>>> > | Ordinarie pris KBH  |
>>> > | Ordinarie pris KBH  |
>>> > | Biljetter 200 krBH  |
>>> > | Biljetter 200 krBH  |
>>> > | Biljetter 200 krBH  |
>>> > +---------------------+
>>> >
>>> > The bug is that the field values are incorrectly de-serialized and the
>>> > value from the previous row is retained if the subsequent row is
>>> shorter.
>>> >
>>> > The sql query:
>>> >
>>> > "select s.classification.variant variant from dfs.<some> as s limit
>>> 10;"
>>> >
>>> >
>>> > That way the  "Ordinarie pris" becomes "Ordinarie pris KBH" because the
>>> > previous row had the value "Invitation KIF KBH".
>>> >
>>> > Regards,
>>> >   -Stefán
>>> >
>>>
>>>
>>>
>>> --
>>> Kamesh.
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message