hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pankit Thapar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8137) Empty ORC file handling
Date Wed, 17 Sep 2014 17:26:33 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137591#comment-14137591
] 

Pankit Thapar commented on HIVE-8137:
-------------------------------------

I think Tez works in this case  because in Tez related code flow, hive creates files for empty
tables.
I dont know if that would be the right approach for OrcInputFormat.
Also, one way to avoid creating split would be to list file status in CombineHiveInputFormat.getSplits()
and filter out zero length files. then pass on that list to hadoop. But going this way, we
add an O(n) overhead of getting file status.


> Empty ORC file handling
> -----------------------
>
>                 Key: HIVE-8137
>                 URL: https://issues.apache.org/jira/browse/HIVE-8137
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 0.13.1
>            Reporter: Pankit Thapar
>             Fix For: 0.14.0
>
>
> Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed
to have a post-script
> which the ReaderIml class tries to read and initialize the footer with it. But in case,
the file is empty 
> or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl
trying to read in its constructor.
> Code Snippet : 
> //get length of PostScript
> int psLen = buffer.get(readSize - 1) & 0xff; 
> In the above code, readSize for an empty file is zero.
> I see that ensureOrcFooter() method performs some sanity checks for footer , 
> so, either we can move the above code snippet to ensureOrcFooter() and throw a "Malformed
ORC file exception" or we can create a dummy Reader that does not initialize footer and basically
has hasNext() set to false so that it returns false on the first call.
> Basically, I would like to know what might be the correct way to handle an empty ORC
file in a mapred job?
> Should we neglect it and not throw an exception or we can throw an exeption that the
ORC file is malformed.
> Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message