hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: HDFS data and non-aligned splits
Date Thu, 23 May 2013 17:59:00 GMT
Related to this, I see in the elephant book under "Which compression format should I use":
"Use a container file format such as Sequence File..."
Does Sequence File attempt to align compressed data on block boundaries?

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Thursday, May 23, 2013 11:53 AM
To: user@hadoop.apache.org
Subject: HDFS data and non-aligned splits

What happens when MR produces data splits, and those splits don't align on block boundaries?
 I've read that MR will attempt to make data splits near block boundaries to improve data
locality, but isn't there always some slop where records straddle the block boundaries, resulting
in an extra HDFS connection just to get the half-record in the other block?  Does this impact
performance?  Are there file formats that attempt to enforce data alignment?

View raw message