hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Puneet Gupta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
Date Sat, 15 Feb 2014 08:53:19 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902353#comment-13902353
] 

Puneet Gupta commented on HIVE-5922:
------------------------------------

I got a similar Exception  ( on seeking to row 9,103,258 )

java.io.IOException: Seek outside of data in compressed stream Stream for column 65 kind DATA
position: 1572882 length: 2116178 range: 1 offset: 1048588 limit: 1048588 range 0 = 0 to 0;
 range 1 = 524294 to 1048588;  range 2 = 1835029 to 262147 uncompressed: 1048588 to 1048588
to 1572882
	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:277)
	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:153)
	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:197)
	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:161)
	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:54)
	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.skip(RunLengthIntegerReaderV2.java:318)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.skipRows(RecordReaderImpl.java:427)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.skipRows(RecordReaderImpl.java:1181)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:2183)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.seekToRow(RecordReaderImpl.java:2284)

Some observations
----
1. I have used Snappy for compression 

2. there are 75 columns in the file (mostly numbers - int,long,byte,short  and a few strings).
Exception always happens for column 65 which is an int. If I remove this column from include
column list, seek works fine . 

2. This issue happens only when I am seeking to row using RecordReader.seekToRow(long). In
this flow the RecordReader is created using  Reader.rows(long, long, boolean[], SearchArgument,
String[]). The SearchArgument is using "IN" construct with 200 long values which are actually
the row numbers I want to retrieve   (SearchArgument.FACTORY.newBuilder().startOr().in(colName,
200 Long Values).end().build()). Exception happens for seek to row 9103258 (file has about
13 million rows). I tried SearchArgument with just one IN value of 9103258.... BINGO .. got
the same Exception. This problem can be reproduced for any rowSeek between 9103258 and 9103279.
Rows after this seem to work fine .

3. I face no Exceptions if the RecordReader is created using Reader.rows(null), and the entire
file is iterated using RecordReader.hasNext() and RecordReader.next()

4. I face no Exceptions if the RecordReader is created using  Reader.rows(long, long, boolean[],
SearchArgument, String[]) and SearchArgument is passed null. Then the required data (about
200 rows) is retrieved using RecordReader.seekToRow(long) and RecordReader.next()

5.Obvious WorkAround is not to use predicate push down . In may case since I know the row
numbers to be seeked to, the performance let down is not very drastic. 
	Read/SeekTO 167 rows in (ms)3609   : Existing usage with predicate push down in ORC  
	Read/SeekTo 167 rows in (ms)4626   : WorkAround without predicate/Search-Argument pushdown
	   >>>> Difference of  1017 ms  = approx 7 ms per row let down in performance
(arounf 80% values are fetched from different strides)


> In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i]
+ bytes[i].remaining() when ORC predicate pushdown is enabled
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5922
>                 URL: https://issues.apache.org/jira/browse/HIVE-5922
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>            Reporter: Yin Huai
>
> Two stack traces ...
> {code}
> java.io.IOException: IO error in map input file hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/000004_0
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.io.IOException: Seek outside of data in compressed
stream Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 offset:
1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 17039555 to 1310735;  range
2 = 20447466 to 1048588;  range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588; 
range 5 = 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> 	... 9 more
> Caused by: java.io.IOException: Seek outside of data in compressed stream Stream for
column 9 kind DATA position: 21496054 length: 33790900 range: 2 offset: 1048588 limit: 1048588
range 0 = 13893791 to 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;
 range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735
uncompressed: 262144 to 262144 to 21496054
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
> 	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
> 	... 13 more
> {\code}
> {code}
> java.io.IOException: IO error in map input file hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/000095_0
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.lang.IllegalStateException: Can't read header at
compressed stream Stream for column 9 kind DATA position: 20447466 length: 20958101 range:
6 offset: 1835029 limit: 1835029 range 0 = 0 to 524294;  range 1 = 1835029 to 2097176;  range
2 = 5242940 to 1835029;  range 3 = 8650851 to 1835029;  range 4 = 11796615 to 2097176;  range
5 = 15204526 to 2097176;  range 6 = 18612437 to 1835029 uncompressed: 262144 to 262144
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> 	... 9 more
> Caused by: java.lang.IllegalStateException: Can't read header at compressed stream Stream
for column 9 kind DATA position: 20447466 length: 20958101 range: 6 offset: 1835029 limit:
1835029 range 0 = 0 to 524294;  range 1 = 1835029 to 2097176;  range 2 = 5242940 to 1835029;
 range 3 = 8650851 to 1835029;  range 4 = 11796615 to 2097176;  range 5 = 15204526 to 2097176;
 range 6 = 18612437 to 1835029 uncompressed: 262144 to 262144
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:195)
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
> 	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
> 	... 13 more
> {\code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message