accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-578) consider using hdfs for the walog
Date Mon, 21 May 2012 21:25:41 GMT


Keith Turner commented on ACCUMULO-578:

It seems like these changes will work.  Given the following events.

 # TserverA creates Walog1
 # User deletes lock for TserverA
 # GC does not see TServerA in zookeeper
 # GC does not see any references to Walog1 in !METADATA
 # TserverA writes that TabletX is using Walog1 to !METADATA
 # TserverA writes mutations for TabletX to Walog1
 # TserverA sends confirmation of write to client
 # TserverA notices its lock went away and kills itself

It must be possible to interleave the following two GC events in any order after event 4 above
and not have data loss.

 # GC recovers lease on Walog1
 # GC looks for Walog1 in !METADATA

If recovering the lease will make writes by other processes fail after that point in time,
I think its ok.  With this assumption all interleavings should result in an error writing
to the walog or the GC seeing the walog in !METADATA.  An error writing the walog would prevent
the confirmation from being sent to the client.  If writes can occur after lease recover,
then there is an interleaving that will result in data loss.

> consider using hdfs for the walog
> ---------------------------------
>                 Key: ACCUMULO-578
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: logger, tserver
>    Affects Versions: 1.5.0-SNAPSHOT
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>         Attachments: HDFS_WAL_states.pdf, comparison.png
> Using HDFS for walogs would fix:
>  * ACCUMULO-84: any node can read the replicated files
>  * ACCUMULO-558: wouldn't need to monitor loggers
>  * ACCUMULO-544: log references wouldn't include hostnames
>  * ACCUMULO-423: wouldn't need to monitor loggers
>  * ACCUMULO-258: hdfs has load balancing already
> To implement it, we would need the ability to distribute log sorts.
> Continuing to use loggers helps us avoid:
>  * hdfs pipeline strategy
>  * we don't have fine-grained insight when a single node makes dfs slow
>  * additional namenode pressure
>  * flexibility: for example, we can add fadvise() calls to the logger before HDFS supports

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message