orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: Google Protobuf Version
Date Tue, 26 Sep 2017 22:02:21 GMT
The extra characters after the instances of ORC are because the following
characters look like valid characters and the strings command is a generic
tool. Of course you could accidentally get 0x4f, 0x52, 0x43 "ORC" in the
file, but that is relatively unlikely.

Your output that implies that you used Writer.writeIntermediateFooter to
put in to intermediate footers into the file. Since there is a large gap
from the last offset to the length of the file, I would guess that your
application didn't close the writer to get the final footer at the end of
the file. Try passing in 33162188 in as the ReaderOptions.maxLength(). You
should get a valid reader then and be able to read the data before that
footer (ignoring the last 6mb of data in the file).

.. Owen


On Tue, Sep 26, 2017 at 12:09 PM, Yonatan Augarten <yoni@intango.com> wrote:

> Thank you for the detailed explanation!
>
> Interesting. I'm getting the following (very strange) output (including
> the spaces before the 0):
>
>>       0 ORC&
>> 10288812     ORC
>> 14991902 ORC
>> 33162184 ORC_R
>>
>
> The file size is 39845888 bytes.
>
> On Tue, Sep 26, 2017 at 11:49 AM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
>
>> Ok, it was reading the postscript (via OrcProto$Postscript.parseFrom),
>> which is the very first thing it does.
>>
>> The first thing to try is to see if you have a proper postscript
>> somewhere in the file. If you are on Mac or Linux,
>> try:
>>
>> % strings -n 3 -t d example/decimal.orc | grep ORC
>>
>> Replacing example/decimal.orc with your ORC file. You'll get an output
>> like:
>>
>> 0 ORC
>> 16333 ORC
>>
>> which are the offsets where "ORC" is located. The ORC format puts it once
>> at the front of the file (so that the "file" command can detect the format)
>> and once at the end of the postscript. (There is always one byte after the
>> last ORC, which is the length of the postscript, so the total length of the
>> file should be the final offset + 4.)
>>
>> .. Owen
>>
>> On Tue, Sep 26, 2017 at 1:36 AM, Yonatan Augarten <yoni@intango.com>
>> wrote:
>>
>>> No, the file is invalid. The problem is that our code sometimes
>>> generates invalid ORC files.
>>> The code is always called from a single thread, and it performs a series
>>> of "addRowBatch" actions on a writer.
>>> The file is then closed and loaded to a hive table.
>>> This works 99% of the times, but in some cases the resulting file is
>>> somehow corrupt.
>>> See below the stack trace of an attempt to run orcfiledump on this file.
>>>
>>> Thanks for your help,
>>> Yoni.
>>>
>>> Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException:
>>> Protocol message tag had invalid wire type.
>>>     at com.google.protobuf.InvalidProtocolBufferException.invalidWi
>>> reType(InvalidProtocolBufferException.java:99)
>>>     at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(U
>>> nknownFieldSet.java:498)
>>>     at com.google.protobuf.GeneratedMessage.parseUnknownField(Gener
>>> atedMessage.java:193)
>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>(
>>> OrcProto.java:16466)
>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>(
>>> OrcProto.java:16424)
>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse
>>> PartialFrom(OrcProto.java:16562)
>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse
>>> PartialFrom(OrcProto.java:16557)
>>>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
>>> java:89)
>>>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
>>> java:95)
>>>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
>>> java:49)
>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFr
>>> om(OrcProto.java:16910)
>>>     at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoF
>>> romFooter(ReaderImpl.java:374)
>>>     at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImp
>>> l.java:311)
>>>     at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFil
>>> e.java:228)
>>>     at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(File
>>> Dump.java:96)
>>>     at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:81)
>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:62)
>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>>     at java.lang.reflect.Method.invoke(Method.java:497)
>>>     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>     at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>
>>>
>>>
>>> On Tue, Sep 26, 2017 at 12:11 AM, Owen O'Malley <owen.omalley@gmail.com>
>>> wrote:
>>>
>>>> On Mon, Sep 25, 2017 at 12:47 PM, Yonatan Augarten <yoni@intango.com>
>>>> wrote:
>>>>
>>>>> Would you say that it's likely that this error (*Protocol message
>>>>> contained an invalid tag (zero)*) is caused by the wrong version?
>>>>>
>>>>
>>>>  No, it is likely something else. However, I haven't seen that error
>>>> coming out of the ORC reader before. Can you give me the whole stack trace?
>>>> Are you sure that it is a valid ORC file?
>>>>
>>>> Thanks,
>>>>    Owen
>>>>
>>>
>>>
>>
>

Mime
View raw message