hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms
Date Thu, 28 Apr 2011 23:28:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026772#comment-13026772

Todd Lipcon commented on HDFS-1580:

bq. which is relevant for understanding the edit logs. I agree that version is a little overloaded
but that can be addressed in a different jira

Agreed that's a separate JIRA -- I just wanted to clarify that the version you're talking
about here is the "edits log serialization format version" rather than something about actual

bq. If namenode has called purgeTransactions it should never ask for older transaction ids

Fair enough.

bq. Apart from that, sinceTxnId doesn't assume any boundary

I think that will really complicate things like edits transfer in the 2NN. In the file-based
storage there's no clean way to seek to a particular transaction ID, meaning we'd have to
add in this facility into EditLogInputStream, etc. That's a lot of complexity for little benefit
that I can see.

bq. The motivation for "mark" method was that BK has this limitation that open ledgers cannot
be read, "mark" will give a cue to a BK implementation that the current ledger should be made
available for reading

This seems like a somewhat serious flaw. If we anticipate using BK for HA, I was under the
impression that the "hot backup" would be following along on the edits as they're written
into BK. What you're saying here implies that the primary NN would have to be rolling its
logs every few seconds if you want the standby to be truly "hot".

bq. If an implementation doesn't have this limitation it can just ignore mark, that is why
I didn't call it roll

Another way of doing this is to say that, if an implementation _does_ have this limitation,
it can choose to "mark" whenever it likes. No?

bq. I assumed that a write also syncs, because in most operations we sync immediately after
writing the log, and in this design we are writing the entire transaction as a unit. 

In fact this is not at all how the current design works. Most operations write the edit to
the log while holding the FSN lock (to ensure serialized order between ops) and then drop
the FSN lock to sync. This allows group commit and is crucial for reasonable throughput.

bq. Management of buffers and flush, should be the responsibility of the implementation.

But flush needs to be coordinated as a separate action from writing in order to achieve lock
release and group commit.

bq. readNext() reads the version and txnId and synchronizes the underlying inputstream to
the begining of transaction record and then getTxn can directly return the underlying inputstream
for reading the transaction bytes

Yep, that makes sense.

bq. LogSegments gets rid of roll method but exposes the underlying units of storage to the
namenode which I don't think is required
It's not absolutely required in the theoretical sense, but in the sense that we'd like to
keep the code as simple as possible, I think it helps that goal. For example, edit log transfer
right now is based around the concept of discrete files which can be entirely fetched, with
an associated md5sum. If we have to support fetching arbitrary ranges of transactions, these
safety checks become more difficult to implement. And, we need to split the "file transfer"
code into two different code paths, one for files (fsimage) and another for edits (arbitrary
transaction ranges)

bq. Do we really want this property? Isn't it better that we don't expose any boundaries between
transactions to the namenode?

Yes, this property is very useful for operations. Refer to the discussion on HDFS-1073 about
this property. The fact that I can run "md5sum /data/{1..4}/dfs/name/current/*" and verify
that the files are all identical gives me great peace of mind.

> Add interface for generic Write Ahead Logging mechanisms
> --------------------------------------------------------
>                 Key: HDFS-1580
>                 URL: https://issues.apache.org/jira/browse/HDFS-1580
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>         Attachments: EditlogInterface.1.pdf, HDFS-1580+1521.diff, HDFS-1580.diff, HDFS-1580.diff,
HDFS-1580.diff, generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.txt

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message