hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pankit Thapar (JIRA)" <>
Subject [jira] [Updated] (HIVE-8137) Empty ORC file handling
Date Sat, 04 Oct 2014 22:38:34 GMT


Pankit Thapar updated HIVE-8137:
    Status: Patch Available  (was: Open)

Current Logic
CombineHiveInputFormat.getSplits() makes a call to CombineFileInputFormatShim which is a child
class for CombinFileInputFormat (in hadoop).
CombineFileInputFormatShim calls CombineFileInputFormat.getSplits(), which creates splits
w/o checking for the file size. So, as a result we 
get combineFileSplits which have empty files. 

Issue with the current logic
Existence of empty files is not correct for ORC files since the format requires certain things
like post-scrips to be present in the file.
this ends up causing ArrayOutOfBound Exception in ORC reader since it tries to access the
post-script which is not present in the empty file.

1. Override listStatus of FileInputformat in CombineFileInputFormatShim,so that when CombineFileInputFormat.getsplits()
calls, listStatus(),
it ends up calling CombineFileInputFormatShim.listStatus() which has the logic for skipping
empty Files when creating splits.

2. Also, avoid creating empty file splits in OrcInputFormat.FileGenerator.

Added two unit tests to test the the two fixes.

> Empty ORC file handling
> -----------------------
>                 Key: HIVE-8137
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 0.13.1
>            Reporter: Pankit Thapar
>             Fix For: 0.14.0
>         Attachments: HIVE-8137.patch
> Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed
to have a post-script
> which the ReaderIml class tries to read and initialize the footer with it. But in case,
the file is empty 
> or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl
trying to read in its constructor.
> Code Snippet : 
> //get length of PostScript
> int psLen = buffer.get(readSize - 1) & 0xff; 
> In the above code, readSize for an empty file is zero.
> I see that ensureOrcFooter() method performs some sanity checks for footer , 
> so, either we can move the above code snippet to ensureOrcFooter() and throw a "Malformed
ORC file exception" or we can create a dummy Reader that does not initialize footer and basically
has hasNext() set to false so that it returns false on the first call.
> Basically, I would like to know what might be the correct way to handle an empty ORC
file in a mapred job?
> Should we neglect it and not throw an exception or we can throw an exeption that the
ORC file is malformed.
> Please let me know your thoughts on this.

This message was sent by Atlassian JIRA

View raw message