hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@webmail.us>
Subject InputFormat for Two Types
Date Sun, 30 Sep 2007 22:33:50 GMT

I need to write a mapreduce program that begins with 2 jobs:
 1. Convert raw log data to SequenceFiles
 2. Read from SequenceFiles, and cherry pick completed events
  (otherwise, keep them as SequenceFiles to be checked again later)
But I should be able to compact those 2 jobs into 1 job.

I just need to figure out how to write an InputFormat that uses 2 types of RecordReaders,
depending on the input file type. Specifically, the inputs would be either raw log data (TextInputFormat),
or partially processed log data (SequenceFileInputFormat).

I think I need to extend SequenceFileInputFormat to look for an identifying extension on the
files. Then I would be able to return either a LineRecordReader or a SequenceFileRecordReader,
and some logic in Map could process the line into a record.

Am I headed in the right direction? Or should I stick with running 2 jobs instead of trying
to squash these steps into 1?


Stu Hood


"You manage your business. We'll manage your email."®
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message