hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: DBInputFormat / DBWritable question
Date Mon, 09 Aug 2010 18:09:18 GMT
Tnx much for the info, and the additional tips.

Unfortunately we're doing a lot of transforming of the DB data as we're 
bringing it into Hadoop, so I don't think Sqoop's an option.

Thanks again,


On 08/06/2010 12:50 AM, Aaron Kimball wrote:
> The InputFormat instantiates a RecordReader (DBRecordReader) in the same
> process as the Mapper. The DBWritable instances are instantiated inside the
> RecordReader and fed directly to your mapper.
> If your mapper then processes the data and sends it directly to the
> OutputFormat (e.g., through TextOutputFormat which just calls
> key/val.toString())  then you do not need to implement the Writable
> interface.
> If you intend to serialize your data to SequenceFiles (through
> SequenceFileOutputFormat, or otherwise) or as intermediate data (to be
> consumed by a reducer) then you need to implement Writable.
> For that matter, if you don't intend to use DBOutputFormat with this data,
> then you don't even need to provide a body for the "void
> write(PreparedStatement)" method; just stub it.
> A couple other tips:
> * Consider using DataDrivenDBInputFormat. It's considerably
> higher-throughput.
> * If you're using CDH (Cloudera's Distribution for Hadoop), rather than
> write your own DBWritable, use Sqoop's code generation capability (sqoop
> codegen --connect ... --table ...) to create your java class for you.
> * Related, if all you're doing is importing a copy of the data to HDFS,
> Sqoop can handle that for you pretty easily :)
> See github.com/cloudera/sqoop and archive.cloudera.com/cdh/3/sqoop for more
> info.
> Cheers,
> - Aaron
> On Wed, Aug 4, 2010 at 7:41 PM, Harsh J<qwertymaniac@gmail.com>  wrote:
>> AFAIK you don't really need serialization if your job is a map-only
>> one; the OutputFormat/RecWriter (if any) should take care of it.
>> On Thu, Aug 5, 2010 at 7:07 AM, David Rosenstrauch<darose@darose.net>
>> wrote:
>>> I'm working on a M/R job which uses DBInputFormat.  So I have to create
>> my
>>> own DBWritable for this.  I'm a little bit confused about how to
>> implement
>>> this though.
>>> In the sample code in the Javadoc for the DBWritable class, the
>> MyWritable
>>> implements both DBWritable and Writable - thereby forcing the author of
>> the
>>> MyWritable class to implement the methods to serialize/deserialize it
>>> to/from DataInput&  DataOutput.  Without getting into too much detail,
>>> having to implement this serialization would add a good bit of complexity
>> to
>>> my code.
>>> However, the DBWritable that I'm writing really doesn't need to exist
>> beyond
>>> the Mapper.  I.e., it'll be input to the Mapper, but the Mapper won't
>> emit
>>> it out to the sort/reduce steps.  And after doing some reading/digging
>>> through the code, it looks to me like the InputFormat and the Mapper
>> always
>>> get run on the same host&  JVM.  If that's in fact the case, then there'd
>> be
>>> no need for me to make my DBWritable implement Writable also and so I
>> could
>>> avoid the whole serialization/deserialization issue.
>>> So my question is basically:  have I got this correct?  Do the
>> InputFormat
>>> and the Mapper always run in the same VM?  (In which case I can do what
>> I'm
>>> planning and code the DBWritable without the serialization headaches from
>>> the Writable class.)
>>> TIA,
>>> DR
>> --
>> Harsh J
>> www.harshj.com

View raw message