drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter McTaggart (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file
Date Tue, 01 Dec 2015 22:43:10 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034772#comment-15034772
] 

Peter McTaggart commented on DRILL-4145:
----------------------------------------

I have posted my storage plugin config for S3 in the 'environment' field

Basically, I added "csv" to the extensions list in the "csvh" format section. This has extractHeaders
set to true and parses out the first line.  (I also tried to set extractHeaders in the "csv"
format section but it didn't seem to work and I didn't pursue it further)

{noformat}
"csvh":
{ "type": "text", "extensions": [ "csvh", "csv" ], "extractHeader": true, "delimiter": ","
}
}
{noformat}

I am using the official 1.3.0 release from an apache mirror site.

{noformat}
0: jdbc:drill:> select * from sys.version;
+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+---------------------+----------------------------+
| version  |                 commit_id                 |                   commit_message
                   |        commit_time         |     build_email     |         build_time
        |
+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+---------------------+----------------------------+
| 1.3.0    | cc127ff4ac6272d2cb1b602890c0b7c503ea2062  | [maven-release-plugin] prepare release
drill-1.3.0  | 17.11.2015 @ 22:05:19 PST  | jacques@apache.org  | 17.11.2015 @ 22:09:19 PST
 |
+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+---------------------+----------------------------+
1 row selected (0.975 seconds)
{noformat}

Note:  I have 6 files that contain the same type of data and are roughly the same size (I
think the only difference apart from the data values is that the columns may be in different
orders in the files)  Three of these files work fine and 3 seem to have this problem - which
is weird.

On the files that cause this problem, I have narrowed two (haven't tried the 3rd yet) of them
down to this 4096 line size (where they work) -- both fail when the number of lines is increased
to 4097 or more.



> IndexOutOfBoundsException raised during select * query on S3 csv file
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4145
>                 URL: https://issues.apache.org/jira/browse/DRILL-4145
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>         Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://<bucket-name-was-here>",
>   "workspaces": {
>     "root": {
>       "location": "/",
>       "writable": false,
>       "defaultInputFormat": null
>     },
>     "views": {
>       "location": "/processed",
>       "writable": true,
>       "defaultInputFormat": null
>     },
>     "tmp": {
>       "location": "/tmp",
>       "writable": true,
>       "defaultInputFormat": null
>     }
>   },
>   "formats": {
>     "psv": {
>       "type": "text",
>       "extensions": [
>         "tbl"
>       ],
>       "delimiter": "|"
>     },
>     "csv": {
>       "type": "text",
>       "extensions": [
>         "csv"
>       ],
>       "extractHeader": true,
>       "delimiter": ","
>     },
>     "tsv": {
>       "type": "text",
>       "extensions": [
>         "tsv"
>       ],
>       "delimiter": "\t"
>     },
>     "parquet": {
>       "type": "parquet"
>     },
>     "json": {
>       "type": "json"
>     },
>     "avro": {
>       "type": "avro"
>     },
>     "sequencefile": {
>       "type": "sequencefile",
>       "extensions": [
>         "seq"
>       ]
>     },
>     "csvh": {
>       "type": "text",
>       "extensions": [
>         "csvh",
>         "csv"
>       ],
>       "extractHeader": true,
>       "delimiter": ","
>     }
>   }
> }
>            Reporter: Peter McTaggart
>         Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1-bad.csv`
limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0,
16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on ip-XXXXX.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | FIELD_1  |       FIELD_2        | FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6   | FIELD_7
 |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |       FIELD_12       | FIELD_13  | FIELD_14
 | FIELD_15  | FIELD_16  | FIELD_17  | FIELD_18  | FIELD_19  |       FIELD_20       | FIELD_21
 | FIELD_22  | FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | FIELD_29
 | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | FIELD_35  |
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | 489517   | 27/10/2015 02:05:27  | 261      | 1130232  | 0        | 925630488  | 0 
      | 925630488  | -1       | 19531580547  | 00000000  | 27/10/2015 02:00:00  |        
  | 30        | 300       | 0         | 0         | 00000000  | 00000000  | 27/10/2015 02:05:27
 | 0         | 1         | 0         | 35.0      |           |           |           | 505
      | 872.0     |           | aBc       |           |           |           |          
|
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> 1 row selected (1.094 seconds)
> 0: jdbc:drill:>  {noformat}
> Good file: apps1.csv, and 
> Bad file: apps1-bad.csv  attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message