hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit...@gmail.com>
Subject Re: Python + hdfs written thrift sequence files: lots of moving parts!
Date Tue, 25 Sep 2012 19:01:45 GMT
Thanks harsh: In any case, I'm really curious about how it is that sequence
file headers are formatted, as the documentation in the SequenceFile
javadocs seems to be very generic.

To make my questions more concrete:

1) I notice that the FileSplit class has a getStart() function.  It is
documented as returning the place to start "processing".  Does that imply
that a FileSplit does, or does not include a header?

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html#getStart%28%29

2) Also, Its not clear to me that how compression and serialization are
related.  These are two inticrately coupled aspects of HDFS file writing,
and im not sure what the idiom for coordinating the compression of records
to  the deserialization is.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message