drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Phillips (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file
Date Mon, 04 May 2015 23:33:06 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527558#comment-14527558
] 

Steven Phillips commented on DRILL-2849:
----------------------------------------

The problem is that we don't really have a good concept of null in Text files. In the original
text file, some rows have 7 entries, and some have 8. So when selecting columns[7], for the
rows where it is missing, we return null.

The CSV writer, on the other hand, write the string "null" when it sees a null value. But
since there is no concept of null in text files as far as the reader is concerned, this is
simply returned as a string "null".

Also of significance, is the fact that the original csv file doesn't appear to be valid. I
see lines like this scattered throughout:

1348755809001,/user/ovguide,1363510469000,/user/turtlewax_bot,/m/0jkdvpx,/common/topic/description,"Guests:
Columnist Thomas Friedman, writer David Frum, singer Natalie Mames, Gen. Wesley Clark and
Sen. Barbara Boxer.

Note how there is no closing quotation mark for the string beginning with "Guests". This is
causing unexpected behavior because the reader currently treats the quotations mark as part
of the string, and thus faithfully writes it to the new file when doing CTAS. The null value
that comes after is then assumed by the reader to part of the same string.

> Difference in query results over CSV file created by CTAS, compared to results over original
CSV file 
> ------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2849
>                 URL: https://issues.apache.org/jira/browse/DRILL-2849
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 0.9.0
>         Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: value vectors
should report valid value count | 16.04.2015 @ 13:53:34 EDT
>            Reporter: Khurram Faraaz
>            Assignee: Steven Phillips
>            Priority: Critical
>             Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another CSV data
file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV file that
CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-00000-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as double)),min(cast(columns[0]
as double)),avg(cast(columns[0] as double)), columns[7] from `deletions/deletions-00000-of-00020.csv`
group by columns[7];
> 88 rows selected (6.893 seconds)
> =================================================
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was deletions/deletions-00000-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as double)),min(cast(columns[0]
as double)),avg(cast(columns[0] as double)), columns[7] from `csvToCSV_00000_of_00020/0_0_0.csv`
group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==================================================
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as double)),min(cast(columns[0]
as double)),avg(cast(columns[0] as double)), columns[7] from `deletions-00000-of-00020.csv.gz`
group by columns[7];
> 88 rows selected (10.526 seconds)
> ==================================================
> {code}
> In the below cases, the count and sum results are different when query is executed over
CSV file that was created by CTAS. ( this may explain why we see the difference in results
in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] as double)),min(cast(columns[0]
as double)),avg(cast(columns[0] as double)), columns[7] from `deletions/deletions-00000-of-00020.csv`
where columns[7] is null group by columns[7];
> +------------+------------+------------+------------+------------+
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> +------------+------------+------------+------------+------------+
> | 252        | 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | null  
    |
> +------------+------------+------------+------------+------------+
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] as double)),min(cast(columns[0]
as double)),avg(cast(columns[0] as double)), columns[7] from `deletions-00000-of-00020.csv.gz`
where columns[7] is null group by columns[7];
> +------------+------------+------------+------------+------------+
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> +------------+------------+------------+------------+------------+
> | 252        | 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | null  
    |
> +------------+------------+------------+------------+------------+
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query is executed
over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] as double)),min(cast(columns[0]
as double)),avg(cast(columns[0] as double)), columns[7] from `csvToCSV_00000_of_00020/0_0_0.csv`
where columns[7] is null group by columns[7];
> +------------+------------+------------+------------+------------+
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> +------------+------------+------------+------------+------------+
> | 245        | 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | null    
  |
> +------------+------------+------------+------------+------------+
> 1 row selected (5.736 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message