hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wellington Chevreuil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12618) fsck -includeSnapshots reports wrong amount of total blocks
Date Thu, 04 Jan 2018 16:47:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311620#comment-16311620

Wellington Chevreuil commented on HDFS-12618:

bq. validate() then catch AssertionError should be changed, for the reasons Daryn mentioned,
plus the fact that assertion could be disabled at run time. See https://docs.oracle.com/javase/8/docs/technotes/guides/language/assert.html#enable-disable
Latest submitted patch already has refactorings to replace usage of validate and reliance
on AssertionError handling.

bq. I'm not sure the current getLastINode()==null check is enough for INodeReference}}s. What
if the block changed in the middle of the snapshots? For example, say file 1 has block 1&2.
Then the following happened: snapshot s1, truncate so file has block 1-only, snapshot s2,
append so file has block 1&3, snapshot s3. Would we be able to tell the difference when
{{fsck -includeSnapshots now?
I added a unit test for such scenario, it must appear in the next patch to be submitted with
additional observations. Current condition seems to be covering this situation, test is passing.

bq. Because locks are reacquired during fsck, it's theoretically possible that snapshots are
created / deleted during the scan. I think current behavior is we're not aware of new snapshots,
and skip the deleted snapshots (since snapshottableDirs is populated before the check call.
Possible to add a fault-injected test to make sure we don't NPE on deleted snapshots?
I'm not sure I follow this. Is it that we need to make sure snapshottableDir != null? If so,
we do have an if on *checkDir* method, line #506.

bq. Speechlessly NamenodeFsck also has other block counts like numMinReplicatedBlocks. Current
code only takes care of total blocks, which IMO is the most important. This also seems to
be the goal of this jira as suggested by the title and description, so Okay to split that
to another jira.
So there's another metric currently broken? I may open another jira for that, but would like
to first get this sorted.

bq. I see the variable name of checkDir is changed to filePath, which is not accurate. Prefer
to keep the old name path.
That was changed to fix checkstyle warning.

bq. checkFilesInSnapshotOnly: suggest to handle inode==null in it's own block, so we don't
have to worry about that for non INodeFile code paths. (FYI null is not instanceof anything,
so patch 4 code didn't have to check. Need to be careful after changing to isFile, as (correctly)
suggested by Daryn.)
Last patch applied suggestions from Daryn to use *isFile* helper method, so now I guess we
need to make sure inode is not null.

bq. lastSnapshotId = -1 should use Snapshot.NO_SNAPSHOT_ID rather than -1.
Applied on last patch.

bq. inodeFile..getFileWithSnapshotFeature().getDiffs() cannot never null judging from FileWithSnapshotFeature,
so no need for nullity check
Fixed on last patch.

bq. Please format the code you changed. There are many space inconsistencies around brackets.
Formatted on last patch.

bq. Test should add timeouts. Perhaps better to just use a Rule on the class, to safeguard
cases by default with something like 3 minutes.
Will be available on next patch.

bq. Feels to me the "HEALTHY" check in the beginning of each test case is not necessary.
Will be available on next patch.

bq. Could use GenericTestUtils.waitFor() for the waits.
Will be available on next patch.

bq. Optional - TestFsck is already 2.4k+ lines long. Maybe better to create a new test class
for snapshot blockcount specifically. In that class the name of each test would be shorter
and more readable.
Will be available on next patch.

> fsck -includeSnapshots reports wrong amount of total blocks
> -----------------------------------------------------------
>                 Key: HDFS-12618
>                 URL: https://issues.apache.org/jira/browse/HDFS-12618
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: HDFS-121618.initial, HDFS-12618.001.patch, HDFS-12618.002.patch,
HDFS-12618.003.patch, HDFS-12618.004.patch, HDFS-12618.005.patch
> When snapshot is enabled, if a file is deleted but is contained by a snapshot, *fsck*
will not reported blocks for such file, showing different number of *total blocks* than what
is exposed in the Web UI. 
> This should be fine, as *fsck* provides *-includeSnapshots* option. The problem is that
*-includeSnapshots* option causes *fsck* to count blocks for every occurrence of a file on
snapshots, which is wrong because these blocks should be counted only once (for instance,
if a 100MB file is present on 3 snapshots, it would still map to one block only in hdfs).
This causes fsck to report much more blocks than what actually exist in hdfs and is reported
in the Web UI.
> Here's an example:
> 1) HDFS has two files of 2 blocks each:
> {noformat}
> $ hdfs dfs -ls -R /
> drwxr-xr-x   - root supergroup          0 2017-10-07 21:21 /snap-test
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:16 /snap-test/file1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:17 /snap-test/file2
> drwxr-xr-x   - root supergroup          0 2017-05-13 13:03 /test
> {noformat} 
> 2) There are two snapshots, with the two files present on each of the snapshots:
> {noformat}
> $ hdfs dfs -ls -R /snap-test/.snapshot
> drwxr-xr-x   - root supergroup          0 2017-10-07 21:21 /snap-test/.snapshot/snap1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:16 /snap-test/.snapshot/snap1/file1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:17 /snap-test/.snapshot/snap1/file2
> drwxr-xr-x   - root supergroup          0 2017-10-07 21:21 /snap-test/.snapshot/snap2
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:16 /snap-test/.snapshot/snap2/file1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:17 /snap-test/.snapshot/snap2/file2
> {noformat}
> 3) *fsck -includeSnapshots* reports 12 blocks in total (4 blocks for the normal file
path, plus 4 blocks for each snapshot path):
> {noformat}
> $ hdfs fsck / -includeSnapshots
> FSCK started by root (auth:SIMPLE) from / for path / at Mon Oct 09 15:15:36
BST 2017
> Status: HEALTHY
>  Number of data-nodes:	1
>  Number of racks:		1
>  Total dirs:			6
>  Total symlinks:		0
> Replicated Blocks:
>  Total size:	1258291200 B
>  Total files:	6
>  Total blocks (validated):	12 (avg. block size 104857600 B)
>  Minimally replicated blocks:	12 (100.0 %)
>  Over-replicated blocks:	0 (0.0 %)
>  Under-replicated blocks:	0 (0.0 %)
>  Mis-replicated blocks:		0 (0.0 %)
>  Default replication factor:	1
>  Average block replication:	1.0
>  Missing blocks:		0
>  Corrupt blocks:		0
>  Missing replicas:		0 (0.0 %)
> {noformat}
> 4) Web UI shows the correct number (4 blocks only):
> {noformat}
> Security is off.
> Safemode is off.
> 5 files and directories, 4 blocks = 9 total filesystem object(s).
> {noformat}
> I would like to work on this solution, will propose an initial solution shortly.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message