drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Reshetov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-2677) Query does not go beyond 4096 lines in small JSON files
Date Fri, 03 Apr 2015 20:45:53 GMT

     [ https://issues.apache.org/jira/browse/DRILL-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Reshetov updated DRILL-2677:
--------------------------------------
    Attachment: dataset_4096_and_1.json
                dataset_4095_and_1.json

First two files regarding root cause of issue.

> Query does not go beyond 4096 lines in small JSON files
> -------------------------------------------------------
>
>                 Key: DRILL-2677
>                 URL: https://issues.apache.org/jira/browse/DRILL-2677
>             Project: Apache Drill
>          Issue Type: Bug
>         Environment: drill 0.8 official build
>            Reporter: Alexander Reshetov
>         Attachments: dataset_4095_and_1.json, dataset_4096_and_1.json
>
>
> Hello,
> I'm trying to execute next query:
> {code}
> select * from (select source.pck, source.`timestamp`, flatten(source.HostUpdateTypeNW.Transfers)
as entry from dfs.`/mnt/data/dataset_4095_and_1.json` as source) as parsed;
> {code}
> And it works as expected and I got result:
> {code}
> +------------+------------+------------+
> |    pck     | timestamp  |   entry    |
> +------------+------------+------------+
> | 3547       | 1419807470286356 | {"TransferingPurpose":"8","TransferingImpact":"88","TransferingKind":"8","TransferingTime":"888888888","PackageOrigSenderID":"8","TransferingID":"88888","TransitCN":"888","PackageChkPnt":"8888","PackageFullSize":"8","TransferingSessionID":"8","SubpackagesCounter":"8"}
|
> +------------+------------+------------+
> 1 row selected (0.188 seconds)
> {code}
> This file contains 4095 same lines of one JSON string + at the end another JOSN line
(see attached file dataset_4095_and_1.json)
> The problem is when first string repeats more than 4095 times query got exception. Here
is query for file with 4096 string of first type + 1 string of another (see attached file
dataset_4096_and_1.json).
> {code}
> select * from (select source.pck, source.`timestamp`, flatten(source.HostUpdateTypeNW.Transfers)
as entry from dfs.`/mnt/data/dataset_4096_and_1.json` as source) as parsed;
> Exception in thread "2ae108ff-b7ea-8f07-054e-84875815d856:frag:0:0" java.lang.RuntimeException:
Error closing fragment context.
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:224)
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:187)
> 	at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: org.apache.drill.exec.vector.NullableIntVector
cannot be cast to org.apache.drill.exec.vector.RepeatedVector
> 	at org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.getFlattenFieldTransferPair(FlattenRecordBatch.java:274)
> 	at org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.setupNewSchema(FlattenRecordBatch.java:296)
> 	at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:78)
> 	at org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext(FlattenRecordBatch.java:122)
> 	at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
> 	at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
> 	at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99)
> 	at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89)
> 	at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
> 	at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
> 	at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
> 	at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
> 	at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:68)
> 	at org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:96)
> 	at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:58)
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:163)
> 	... 4 more
> Query failed: RemoteRpcException: Failure while running fragment., org.apache.drill.exec.vector.NullableIntVector
cannot be cast to org.apache.drill.exec.vector.RepeatedVector [ cb6c7914-438f-440a-9c74-fe39130feca9
on testlab-broker:31010 ]
> [ cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ]
> Error: exception while executing query: Failure while executing query. (state=,code=0)
> {code}
> It means that Drill stops analyze schema exactly after 4096 lines and that's why my query
is failing.
> And I assume that such behavior lead to another issue from which I investigated this
one. It could be shown on large files, perhaps Drill somehow split file into smaller chunks
and in one of them exists similar sequence of lines (4096 of the same type from Drill point
of view and it stops query which lead to another exception). Large file attached as dataset_sample.json.gz
> Here is view (dataset_sample.view.drill) which I use for query of large file:
> {code}
> {
>   "name" : "dataset_sample",
>   "sql" : "SELECT `Message`.`timestamp`, `flatten`(`Message`.`HostUpdateTypeCR`['Transfers'])
AS `entries`\nFROM `dfs`.`/mnt/data/dataset_sample.json.gz` AS `Message`",
>   "fields" : [ {
>     "name" : "timestamp",
>     "type" : "ANY"
>   }, {
>     "name" : "transfers",
>     "type" : "ANY"
>   } ],
>   "workspaceSchemaPath" : [ "dfs", "mnt" ]
> }
> {code}
> And here is query which I'm trying to execute:
> {code}
> 0: jdbc:drill:zk=local> create table dataset_tbl as
> . . . . . . . . . . . > select dataset_sample.transfers.TransferingID as id, dataset_sample.transfers.TransferingKind
as type from dataset_sample;
> Query failed: Query stopped., index: 9502, length: 1 (expected: range(0, 1024)) [ c5eac3ee-0266-4645-b6b5-2a1b58df4821
on testlab-broker:31010 ]
> Error: exception while executing query: Failure while executing query. (state=,code=0)
> 0: jdbc:drill:zk=local> Exception in thread "WorkManager-19" java.lang.IllegalStateException
> 	at com.google.common.base.Preconditions.checkState(Preconditions.java:133)
> 	at org.apache.drill.common.DeferredException.addException(DeferredException.java:47)
> 	at org.apache.drill.common.DeferredException.addThrowable(DeferredException.java:61)
> 	at org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:133)
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:181)
> 	at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> Please let me know if I should split this issue to two separate issues or if you need
any additional info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message