hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Safdar Kureishy <safdar.kurei...@gmail.com>
Subject Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.
Date Sat, 05 May 2012 20:05:53 GMT
Hi,

I have attached a *Sequence* file with the following format:
<url:Text> <data:CrawlDatum>

(CrawlDatum is a custom Java type, that contains several fields that would
be flattened into several columns by the SerDe).

In other words, what I would like to do, is to expose this URL+CrawlDatum
data via a Hive External table, with the following columns:
|| url || status || fetchtime || fetchinterval || modifiedtime || retries
|| score || metadata ||

So, I was hoping that after defining a custom SerDe, I would just have to
define the Hive table as follows:

CREATE EXTERNAL TABLE *crawldb*
(url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
modifiedtime LONG, retries INT, score FLOAT, metadata MAP)
ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
STORED AS *SEQUENCEFILE*
LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';

For example, a sample record should like like the following through a Hive
table:
|| http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 ||
1 || 0.98 || {x=1,y=2,p=3,q=4} ||

I would like this to be possible without having to duplicate/flatten the
data through a separate transformation. Initially, I thought my custom
SerDe could have following definition for serialize():

        @override
public Object deserialize(*Writable** obj*) throws SerDeException {
            ...
         }

But the problem is that the input argument *obj *above is only the
*VALUE* portion
of a Sequence record. There seems to be a limitation with the way Hive
reads Sequence files. Specifically, for each row in a sequence file, the
KEY is ignored and only the VALUE is used by Hive. This is seen from the *
org.apache.hadoop.hive.ql.**exec.FetchOperator*::*getNextRow*() method
below, which ignores the KEY when iterating over a RecordReader (see bold
text below from the corresponding Hive code for
FetchOperator::getNextRow()):

  /**
   * Get the next row. The fetch context is modified appropriately.
   *
   **/
  public InspectableObject getNextRow() throws IOException {
    try {
      while (true) {
        if (currRecReader == null) {
          currRecReader = getRecordReader();
          if (currRecReader == null) {
            return null;
          }
        }

        boolean ret = currRecReader.next(*key*, *value*);
        if (ret) {
          if (this.currPart == null) {
*            *Object obj = serde.deserialize(*value*);
            return new InspectableObject(obj*, *serde.getObjectInspector());
          } else {
            rowWithPart[0] = serde.deserialize(*value*);
            return new InspectableObject(rowWithPart, rowObjectInspector);
          }
        } else {
          currRecReader.close();
          currRecReader = null;
        }
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
  }

As you can see, the "key" variable is ignored and never returned. The
problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and
I need it to be displayed in the Hive table along with the fields of
CrawlDatum. But when writing the the custom SerDe, I only see the
CrawlDatum that comes after the key, on each record...which is not
sufficient.

One hack could be to write a CustomSequenceFileRecordReader.java that
returns the offset in the sequence file as the KEY, and an aggregation of
the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
below from SequenceFileRecordReader, which will get really very messy:
  protected synchronized boolean next(K key)
    throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = (in.next(key) != null);
    if (pos >= end && in.syncSeen()) {
      more = false;
    } else {
      more = remaining;
    }
    return more;
  }

This would require me to write a CustomSequenceFileRecordReader and a
CustomSequenceFileInputFormat and then some custom SerDe, and probably make
several other changes as well. Is it possible to just get away with writing
a custom SerDe and some pre-existing reader that includes the key when
invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
have this limitation, when accessing Sequence files? I would imagine that
the key of a sequence file record would be just as important as the
value...so why is it left out by the FetchOperator:getNextRow() method?

If this is the unfortunate reality with reading sequence files in Nutch, is
there another Hive storage format I should use that works around this
limitation? Such as "create external table ..... *STORED AS
CUSTOM_SEQUENCEFILE*"? Or, let's say I write my own
CustomHiveSequenceFileInputFormat, how do i register it with Hive and use
it in the Hive "STORED AS" definition?

Any help or pointers would be greatly appreciated. I hope I'm mistaken
about the limitation above, and if not, hopefully there is an easy way to
resolve this through a custom SerDe alone.

Warm regards,
Safdar

Mime
View raw message