hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hs <aswhol...@gmail.com>
Subject Is Hadoop SequenceFile binary safe?
Date Sat, 27 Apr 2013 11:30:50 GMT

I am learning hadoop.  I read the SequenceFile.java in hadoop-1.0.4 source
codes. And I find the sync(long position) method which is used to find a
"sync marker" (a 16 bytes MD5 when generated at file creation time) in
SequenceFile when splitting SequenceFile into splits in MapReduce.

/** Seek to the next sync mark past a given position.*/public
synchronized void sync(long position) throws IOException {
  if (position+SYNC_SIZE >= end) {

  try {
    seek(position+4);                         // skip escape
    int syncLen = sync.length;
    for (int i = 0; in.getPos() < end; i++) {
      int j = 0;
      for (; j < syncLen; j++) {
        if (sync[j] != syncCheck[(i+j)%syncLen])
      if (j == syncLen) {
        in.seek(in.getPos() - SYNC_SIZE);     // position before sync
      syncCheck[i%syncLen] = in.readByte();
  } catch (ChecksumException e) {             // checksum failure

According to my understanding, these codes simply look for a data sequence
which contain the same data as "sync marker".

My doubt:
Consider a situation where the data in a SequenceFile happen to contain a
16 bytes data sequence the same as "sync marker", the codes above will
mistakenly treat that 16-bytes data as a "sync marker" and then the
SequenceFile won't be correctly parsed?

I don't find any "escape" operation about the data or the sync marker. So,
how can SequenceFile be binary safe? Am I missing something? Please correct
me if I am wrong.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message