accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-1053) continuous ingest detected data loss
Date Fri, 15 Feb 2013 15:41:12 GMT


Eric Newton commented on ACCUMULO-1053:

Letting CI run all night, randomly killing servers, had a failure due to data loss.

The loss occurred during the recovery of the root tablet.  In this case, a small WAL had a
few mutations about the compaction of the table_info METADATA tablet.  When the file was recovered,
the last few mutations were not found in the flushed WAL.

Examining the NN logs, I see the file's block allocated, followed by two fsync's and seven
minutes later, a lease recovery.

After that, the sort file is created.

However, the commitBlockSynchronization on the file finishes some 90 seconds *after* the log
sort is complete.

There's something I'm not understanding about how the HDFS file recovery is supposed to work.
 Time to go back into the HBase code to see what I'm missing.
> continuous ingest detected data loss
> ------------------------------------
>                 Key: ACCUMULO-1053
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test, tserver
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>            Priority: Critical
>             Fix For: 1.5.0
> Now that we're logging directly HDFS, we added datanodes to the agitator. That is, we
are now killing data nodes during ingest, and now we are losing data.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message