drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Crawford (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2616) strings loaded incorrectly from parquet files
Date Thu, 02 Apr 2015 03:34:53 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392076#comment-14392076
] 

Jack Crawford commented on DRILL-2616:
--------------------------------------

When i query through drill, it seems certain strings from some rows are repeated far more
often then they appear in the original data. An example query for the first 5 rows shows this
under the 'indicator' column. If you look further through the select*, the id column shows
it as well, where drill comes back with ~3 or so unique ids, but the actual data source has
many more.

query:
select * from dfs.`indicators.parquet` limit 5;

+------------+------------+------------+------------+
|     id     | timeNanos  | indicator  |   value    |
+------------+------------+------------+------------+
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 | distNear   | -0.0
      |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 | distNear   | -4.0612379933691045E-4
|
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 | distNear   | -0.0
      |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 | distNear   | -2.6080420511220836E-4
|
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555461205550000 | distNear   | -0.0
      |
+------------+------------+------------+------------+

expected output (verified by loading in spark):
                                            id            timeNanos  indicator     value
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555457827764000   distNear -0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555457827764000  smartDiff -0.000406
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555458137319000   distNear -0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555458137319000  smartDiff -0.000261
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555461205550000   distNear -0.000000

> strings loaded incorrectly from parquet files
> ---------------------------------------------
>
>                 Key: DRILL-2616
>                 URL: https://issues.apache.org/jira/browse/DRILL-2616
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jack Crawford
>            Assignee: Jason Altekruse
>            Priority: Critical
>              Labels: parquet
>
> When loading string columns from parquet data sources, some rows have their string values
replaced with the value from other rows.
> Example parquet for which the problem occurs:
> https://drive.google.com/file/d/0B2JGBdceNMxdeFlJcW1FUElOdXc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message