hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vitaliy Fuks (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2395) Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat
Date Wed, 24 Aug 2011 00:47:29 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089907#comment-13089907
] 

Vitaliy Fuks commented on HIVE-2395:
------------------------------------

Right, of course, and that's the workaround we went with. With smaller-than-block files it
really doesn't matter.

After I filed this ticket I did a quick and dirty hack on DeprecatedLzoTextInputFormat to
ignore .lzo.index files. However, I actually found that it doesn't read larger-than-block-size
.lzo files correctly at all - it either crashes with things like ArrayIndexOutOfBoundsException
in LzoDecompressor.setInput() or just outright ignores all data beyond the block size. This
would happen even if .lzo.index files were absent.

So then I recreated tables without using INPUTFORMAT "DeprecatedLzoTextInputFormat" and it's
returning correct data. It still attempts to read .lzo.index files as data so we are going
un-indexed as a workaround (with the lack of splitting as the side effect, obviously). At
this point, I wasn't sure why we were using DeprecatedLzoTextInputFormat in the first place,
other than Google "told" us to. Maybe Hive codebase moved beyond needing it?

I will try code changes from https://github.com/kevinweil/hadoop-lzo/pull/28 when I have some
free time.

PS. Our hadoop-lzo.jar is from January 2011 release (v0.4.8) built by Gerrit.

> Misleading "No LZO codec found, cannot run." exception when using external table and
LZO / DeprecatedLzoTextInputFormat
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2395
>                 URL: https://issues.apache.org/jira/browse/HIVE-2395
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.7.1
>         Environment: Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird
>            Reporter: Vitaliy Fuks
>
> We have a {{/tables/}} directory containing .lzo files with our data, compressed using
lzop.
> We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS INPUTFORMAT
"com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.
> .lzo files require that an LzoIndexer is run on them. When this is done, .lzo.index file
is created for every .lzo file, so we end up with:
> {noformat}
> /tables/ourdata_2011-08-19.lzo
> /tables/ourdata_2011-08-19.lzo.index
> /tables/ourdata_2011-08-18.lzo
> /tables/ourdata_2011-08-18.lzo.index
> ..etc
> {noformat}
> The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is attempting
to getRecordReader() for .lzo.index files. This throws a pretty confusing exception:
> {noformat}
> Caused by: java.io.IOException: No LZO codec found, cannot run.
>         at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
>         at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
>         at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
> {noformat}
> More precisely, it dies on second invocation of getRecordReader() - here is some System.out.println()
output:
> {noformat}
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo:0+616479
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo.index:0+64
> {noformat}
> DeprecatedLzoTextInputFormat contains the following code which causes the ultimate exception
and death of query, as it obviously doesn't have a codec to read .lzo.index files.
> {noformat}
>     final CompressionCodec codec = codecFactory.getCodec(file);
>     if (codec == null) {
>       throw new IOException("No LZO codec found, cannot run.");
>     }
> {noformat}
> So I understand that the way things are right now is that Hive considers all files within
a directory to be part of a table. There is an open patch HIVE-951 which would allow a quick
workaround for this problem.
> Does it make sense to add some hooks so that CombineHiveRecordReader or its parents are
more aware of what files should be considered instead of blindly trying to read everything?
> Any suggestions for a quick workaround to make it skip .index files?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message