hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms
Date Thu, 28 Apr 2011 21:03:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026675#comment-13026675
] 

Todd Lipcon commented on HDFS-1580:
-----------------------------------

Hi Jitendra. Here are some thoughts on your latest document:

- While I appreciate that this work will probably make snapshots a little easier down the
road, it's by far not the most difficult part of supporting snapshots, nor is it really the
goal we're trying to address. So I think it's premature to mention snapshots in the design.
- The concept of "layout version" I think has been overloaded way too much. We currently use
a single version number to indicate (a) the file and serialziation format for image dumps,
(b) the file and serialization format for edit logs, and (c) the actual layout of files within
the {{current/}} directory. I would like to advocate splitting this out into IMAGE_FORMAT_VERSION,
EDITS_FORMAT_VERSION, and LAYOUT_VERSION. To be clear, this jira is mostly concerned with
what I would call EDITS_FORMAT_VERSION (e.g. the way in which we turn a mkdirs into bytes).
Do you agree with this interpretation?
- The idea of a {{purgeTransactions}} call makes sense -- after a checkpoint has been uploaded
for txid N, we don't need edits prior to N. However, there are some policies that make sense
to me like "keep edits for at least a week". Would you assume these retention policies would
be the responsibility of the edit log implementation? ie that, even if told to purge transactions
older than txid N, it might keep them around for some time, or take care of archiving them
to a NAS/HDFS?
- For the {{getInputStream}} call, is there any restriction on valid values of {{sinceTxId}}
that it be on any kind of boundary? e.g that it must correspond to a "mark" call? See more
about this below regarding the idea of "log segments"
- I don't entirely understand the usage of the {{setVersion}} call. When would the version
of a log change mid-stream?
- I'm not entirely clear on "mark" as well. The semantics described in the "Discussion" section
are what I would normally call {{sync}}, but in other parts of the document it's described
as a {{roll}} equivalent. If it's not sync, then we're missing sync altogether, and that implies
that each {{write}} call will have to sync on its own, thus breaking group commit. I think
we should maintain the existing buffering/syncing calls {{write}}, {{setReadyToFlush}}, and
{{flushAndSync}}.
- The {{EditLogInputStream}} interface is strange - it's called InputStream but doesn't follow
a normal InputStream API. It's something sort of like an Iterator, but also doesn't implement
that interface. Could we add a wrapper class {{EditTransaction}}, and make EditLogInputStream
an Interable<EditTransaction>? EditTransaction would then take the {{getTxnId}} call.
- The API {{getTxn}} shouldn't return {{byte[]}} since that implies an extra buffer copy to
get a transaction into its own array. Instead it should be able to point into an existing
byte array. Alternatively, the input stream could continue to implement InputStream so we
can use the existing editlog loading code.

As I've proposed over in some other JIRAs, I think we should do away with the {{roll}} call,
and instead make the concept of _log segments_ a first class citizen. In the file-based storage
case, a log segment is an individual file. In the BK case, it may be that a log segment is
a ledger (I don't know BK's API well).

Thus, rolling the logs becomes a sequence like:
{code}
    endCurrentLogSegment();
    long nextTxId = getLastWrittenTxId() + 1;
    LOG.info("Rolling edit logs. Next txid after roll will be " + nextTxId);
    startLogSegment(nextTxId);
{code}
where {{endCurrentLogSegment}} closes off the current segment across all journals, and {{startLogSegment}}
starts a new output stream across all journals.

The advantages I see of this approach are:
- elsewhere we have discussed that we want to keep the property that logs always roll together
across all parts of the system, and thus that the storage directories have parallel contents
with identical names and identical file contents. It's possible to achieve this with just
the roll API, but it becomes more obvious how to do it with the segment concept. As one example,
consider what happens when one journal fails (eg due to an NFS mount going down temporarily).
While it's down, we don't write txns to this journal. But, after some time we may notice that
the mount is available again. Rather than just calling {{roll}} here, it makes sense to be
explicit that we're starting a new segment, and be explicit about the starting txid of that
new segment.

- We generally want the property that, while saving a namespace or in safe mode, we don't
accept edits. Thus, it would be nice to have the edit log actually be closed during this operation.
Splitting {{roll}} into a {{endCurrent}} and {{startNext}} allows us to add the namespace
dump between the two and make sure that no edits could possibly be written while saving.

What do you think about these suggestions? You can see a working tree with the "log segment"
concept at https://github.com/toddlipcon/hadoop-hdfs/tree/hdfs-1073-march/src/java/org/apache/hadoop/hdfs/server/namenode/

> Add interface for generic Write Ahead Logging mechanisms
> --------------------------------------------------------
>
>                 Key: HDFS-1580
>                 URL: https://issues.apache.org/jira/browse/HDFS-1580
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: EditlogInterface.1.pdf, HDFS-1580+1521.diff, HDFS-1580.diff, HDFS-1580.diff,
HDFS-1580.diff, generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.txt
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message