hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sukmanowsky <mike.sukmanow...@gmail.com>
Subject Custom InputFormat for Multiline Input File Hive/Hadoop
Date Sat, 08 Oct 2011 00:47:29 GMT
Hi all,

Sending this to core-user@hadoop.apache.org and dev@hive.apache.org.

Trying to process Omniture's data log files with Hadoop/Hive. The file
format is tab delimited and while being pretty simple for the most part,
they do allow you to have multiple new lines and tabs within a field that
are escaped by a backslash (\\n and \\t). As a result I've opted to create
my own InputFormat to handle the multiple newlines and convert those tabs to
spaces when Hive is going to try to do a split on the tabs.

I've found a fairly good reference for doing this using the newer
InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version
of Hive (0.7.0) still uses the old InputFormat API.

I haven't been able to find many tutorials on writing a custom InputFile
using the older API so I'm looking to see if I can get some guidance as to
what may be wrong with the following two classes:

https://gist.github.com/3141e9d27d4e07f5f9ed
https://gist.github.com/79fdab227950a0776616

The SELECT statements within hive currently return nothing and my other
variations returned nothing but NULL values.

This issue is also available on StackOverflow at
http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive.

If there's a resource someone can point me to that'd also be great.

Many thanks in advance,
Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message