Mailing-List: contact dev-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Fri, 28 Apr 2017 10:12:04 +0000 (UTC)
From: "Paul Wilson (JIRA)" <jira@apache.org>
To: dev@drill.apache.org
Message-ID: <JIRA.13067650.1493374273000.77451.1493374324119@Atlassian.JIRA>
In-Reply-To: <JIRA.13067650.1493374273000@Atlassian.JIRA>
References: <JIRA.13067650.1493374273000@Atlassian.JIRA> <JIRA.13067650.1493374273673@jira-lw-us.apache.org>
Subject: [jira] [Created] (DRILL-5451) Query on csv file w/ header fails
 with an exception when non existing column is requested if file is over
 4096 lines long
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 28 Apr 2017 10:12:09 -0000

Paul Wilson created DRILL-5451:
----------------------------------

             Summary: Query on csv file w/ header fails with an exception when non existing column is requested if file is over 4096 lines long
                 Key: DRILL-5451
                 URL: https://issues.apache.org/jira/browse/DRILL-5451
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Text & CSV
    Affects Versions: 1.10.0
         Environment: Tested on CentOs 7 and Ubuntu
            Reporter: Paul Wilson


When querying a text (csv) file with extractHeaders set to true, selecting a non existent column works as expected (returns "empty" value) when file has 4096 lines or fewer (1 header plus 4095 data), but results in an IndexOutOfBoundsException where the file has 4097 lines or more.

With Storage config:
{code:javascript}
"csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
{code}

In the following 4096_lines.csvh has is identical to 4097_lines.csvh with the last line removed.

Results:
{noformat}
0: jdbc:drill:zk=local> select * from dfs.`/test/4097_lines.csvh` LIMIT 2;
+----------+------------------------+
| line_no  |    line_description    |
+----------+------------------------+
| 2        | this is line number 2  |
| 3        | this is line number 3  |
+----------+------------------------+
2 rows selected (2.455 seconds)
0: jdbc:drill:zk=local> select line_no, non_existent_field from dfs.`/test/4096_lines.csvh` LIMIT 2;
+----------+---------------------+
| line_no  | non_existent_field  |
+----------+---------------------+
| 2        |                     |
| 3        |                     |
+----------+---------------------+
2 rows selected (2.248 seconds)
0: jdbc:drill:zk=local> select line_no, non_existent_field from dfs.`/test/4097_lines.csvh` LIMIT 2;
Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))

Fragment 0:0

[Error Id: eb0974a8-026d-4048-9f10-ffb821a0d300 on localhost:31010]

  (java.lang.IndexOutOfBoundsException) index: 16384, length: 4 (expected: range(0, 16384))
    io.netty.buffer.DrillBuf.checkIndexD():123
    io.netty.buffer.DrillBuf.chk():147
    io.netty.buffer.DrillBuf.getInt():520
    org.apache.drill.exec.vector.UInt4Vector$Accessor.get():358
    org.apache.drill.exec.vector.VarCharVector$Mutator.setValueCount():659
    org.apache.drill.exec.physical.impl.ScanBatch.next():234
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:zk=local> 
{noformat}

This seems similar to the issue fixed in [DRILL-4108|https://issues.apache.org/jira/browse/DRILL-4108] but it now only manifests for longer files.

I also see a similar result (i.e. it works for <= 4096 lines, IOBE for >4096 lines) for a {noformat} SELECT count(*) ...{noformat} from these files.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)