hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Ivanov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9197) Snapshot FileDiff added to last snapshot when INodeFile accessTime field is updated
Date Mon, 05 Oct 2015 21:32:27 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Ivanov updated HDFS-9197:
------------------------------
    Description: 
Summary
When a file in HDFS is read, its corresponding inode's accessTime field is updated. If the
file is present in the last snapshot, the accessTime change causes a FileDiff to be added
to the SnapshotDiff of the last snapshot.
This behavior has the following problems:
- Since FileDiff's reside in memory on the namenodes, snapshots become progressively more
memory-heavy with increasing volume of data in hdfs. On a system with frequent updates, e.g.
hourly, this becomes a big problem since for, say 2000 snapshots, one can have 2000 FileDiff's
per file pointing to the same inode.
- FSImage grows in size tremendously, and upload operation from standby to active namenode
takes much longer.
-The generated FileDiff does not contain any useful information that I can see. Since all
FileDiff's for that file are pointing to the same inode, the accessTime they see is the same.-
- I was wrong about the last point. Each FileDiff includes a SnapshotCopy attribute, which
contains the updated accessTime. This may be a feature, but I'd question the value of having
it enabled by default.

Configuration:
CDH 5.0.5 (Hadoop 2.3 / 2.4)
We are NOT overwriting the default parameter:
DFS_NAMENODE_ACCESSTIME_PRECISION_DEFAULT = 3600000;
Note that it determines the allowed frequency of accessTime field updates - every hour by
default.

How to reproduce:
{code}
[root@node1076]# hdfs dfs -ls /data/tenants/testenv.testtenant/wddata
Found 3 items
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:52 /data/tenants/testenv.testtenant/wddata/folder1
-rw-r--r--   3 hdfs hadoop         38 2015-10-05 03:13 /data/tenants/testenv.testtenant/wddata/testfile1
-rw-r--r--   3 hdfs hadoop         21 2015-10-04 10:45 /data/tenants/testenv.testtenant/wddata/testfile2
[root@node1076]# hdfs dfs -ls /data/tenants/testenv.testtenant/wddata/.snapshot
Found 8 items
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:47 /data/tenants/testenv.testtenant/wddata/.snapshot/sn1
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:47 /data/tenants/testenv.testtenant/wddata/.snapshot/sn2
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:52 /data/tenants/testenv.testtenant/wddata/.snapshot/sn3
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:53 /data/tenants/testenv.testtenant/wddata/.snapshot/sn4
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:57 /data/tenants/testenv.testtenant/wddata/.snapshot/sn5
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:58 /data/tenants/testenv.testtenant/wddata/.snapshot/sn6
drwxr-xr-x   - hdfs hadoop          0 2015-10-05 03:13 /data/tenants/testenv.testtenant/wddata/.snapshot/sn7
drwxr-xr-x   - hdfs hadoop          0 2015-10-05 04:20 /data/tenants/testenv.testtenant/wddata/.snapshot/sn8
[root@node1076]# hdfs dfs -createSnapshot /data/tenants/testenv.testtenant/wddata sn9
Created snapshot /data/tenants/testenv.testtenant/wddata/.snapshot/sn9
[root@node1076]# hdfs snapshotDiff /data/tenants/testenv.testtenant/wddata sn8 sn9
Difference between snapshot sn8 and snapshot sn9 under directory /data/tenants/testenv.testtenant/wddata:

################
## IMPORTANT: testfile1 was put into HDFS more than 1 hour ago, which triggers the accessTime
update.
################
[root@node1076]# hdfs dfs -cat /data/tenants/testenv.testtenant/wddata/testfile1
This is test file 1, but now it's 11.
[root@node1076]# hdfs dfs -createSnapshot /data/tenants/testenv.testtenant/wddata sn10
Created snapshot /data/tenants/testenv.testtenant/wddata/.snapshot/sn10
[root@node1076]# hdfs snapshotDiff /data/tenants/testenv.testtenant/wddata sn9 sn10
Difference between snapshot sn9 and snapshot sn10 under directory /data/tenants/testenv.testtenant/wddata:
M	./testfile1
{code}

  was:
Summary
When a file in HDFS is read, its corresponding inode's accessTime field is updated. If the
file is present in the last snapshot, the accessTime change causes a FileDiff to be added
to the SnapshotDiff of the last snapshot.
This behavior has the following problems:
- Since FileDiff's reside in memory on the namenodes, snapshots become progressively more
memory-heavy with increasing volume of data in hdfs. On a system with frequent updates, e.g.
hourly, this becomes a big problem since for, say 2000 snapshots, one can have 2000 FileDiff's
per file pointing to the same inode.
- FSImage grows in size tremendously, and upload operation from standby to active namenode
takes much longer.
- The generated FileDiff does not contain any useful information that I can see. Since all
FileDiff's for that file are pointing to the same inode, the accessTime they see is the same.
-
- I was wrong about the last point. Each FileDiff includes a SnapshotCopy attribute, which
contains the updated accessTime. This may be a feature, but I'd question the value of having
it enabled by default.

Configuration:
CDH 5.0.5 (Hadoop 2.3 / 2.4)
We are NOT overwriting the default parameter:
DFS_NAMENODE_ACCESSTIME_PRECISION_DEFAULT = 3600000;
Note that it determines the allowed frequency of accessTime field updates - every hour by
default.

How to reproduce:
{code}
[root@node1076]# hdfs dfs -ls /data/tenants/testenv.testtenant/wddata
Found 3 items
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:52 /data/tenants/testenv.testtenant/wddata/folder1
-rw-r--r--   3 hdfs hadoop         38 2015-10-05 03:13 /data/tenants/testenv.testtenant/wddata/testfile1
-rw-r--r--   3 hdfs hadoop         21 2015-10-04 10:45 /data/tenants/testenv.testtenant/wddata/testfile2
[root@node1076]# hdfs dfs -ls /data/tenants/testenv.testtenant/wddata/.snapshot
Found 8 items
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:47 /data/tenants/testenv.testtenant/wddata/.snapshot/sn1
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:47 /data/tenants/testenv.testtenant/wddata/.snapshot/sn2
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:52 /data/tenants/testenv.testtenant/wddata/.snapshot/sn3
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:53 /data/tenants/testenv.testtenant/wddata/.snapshot/sn4
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:57 /data/tenants/testenv.testtenant/wddata/.snapshot/sn5
drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:58 /data/tenants/testenv.testtenant/wddata/.snapshot/sn6
drwxr-xr-x   - hdfs hadoop          0 2015-10-05 03:13 /data/tenants/testenv.testtenant/wddata/.snapshot/sn7
drwxr-xr-x   - hdfs hadoop          0 2015-10-05 04:20 /data/tenants/testenv.testtenant/wddata/.snapshot/sn8
[root@node1076]# hdfs dfs -createSnapshot /data/tenants/testenv.testtenant/wddata sn9
Created snapshot /data/tenants/testenv.testtenant/wddata/.snapshot/sn9
[root@node1076]# hdfs snapshotDiff /data/tenants/testenv.testtenant/wddata sn8 sn9
Difference between snapshot sn8 and snapshot sn9 under directory /data/tenants/testenv.testtenant/wddata:

################
## IMPORTANT: testfile1 was put into HDFS more than 1 hour ago, which triggers the accessTime
update.
################
[root@node1076]# hdfs dfs -cat /data/tenants/testenv.testtenant/wddata/testfile1
This is test file 1, but now it's 11.
[root@node1076]# hdfs dfs -createSnapshot /data/tenants/testenv.testtenant/wddata sn10
Created snapshot /data/tenants/testenv.testtenant/wddata/.snapshot/sn10
[root@node1076]# hdfs snapshotDiff /data/tenants/testenv.testtenant/wddata sn9 sn10
Difference between snapshot sn9 and snapshot sn10 under directory /data/tenants/testenv.testtenant/wddata:
M	./testfile1
{code}


> Snapshot FileDiff added to last snapshot when INodeFile accessTime field is updated
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9197
>                 URL: https://issues.apache.org/jira/browse/HDFS-9197
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Alex Ivanov
>
> Summary
> When a file in HDFS is read, its corresponding inode's accessTime field is updated. If
the file is present in the last snapshot, the accessTime change causes a FileDiff to be added
to the SnapshotDiff of the last snapshot.
> This behavior has the following problems:
> - Since FileDiff's reside in memory on the namenodes, snapshots become progressively
more memory-heavy with increasing volume of data in hdfs. On a system with frequent updates,
e.g. hourly, this becomes a big problem since for, say 2000 snapshots, one can have 2000 FileDiff's
per file pointing to the same inode.
> - FSImage grows in size tremendously, and upload operation from standby to active namenode
takes much longer.
> -The generated FileDiff does not contain any useful information that I can see. Since
all FileDiff's for that file are pointing to the same inode, the accessTime they see is the
same.-
> - I was wrong about the last point. Each FileDiff includes a SnapshotCopy attribute,
which contains the updated accessTime. This may be a feature, but I'd question the value of
having it enabled by default.
> Configuration:
> CDH 5.0.5 (Hadoop 2.3 / 2.4)
> We are NOT overwriting the default parameter:
> DFS_NAMENODE_ACCESSTIME_PRECISION_DEFAULT = 3600000;
> Note that it determines the allowed frequency of accessTime field updates - every hour
by default.
> How to reproduce:
> {code}
> [root@node1076]# hdfs dfs -ls /data/tenants/testenv.testtenant/wddata
> Found 3 items
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:52 /data/tenants/testenv.testtenant/wddata/folder1
> -rw-r--r--   3 hdfs hadoop         38 2015-10-05 03:13 /data/tenants/testenv.testtenant/wddata/testfile1
> -rw-r--r--   3 hdfs hadoop         21 2015-10-04 10:45 /data/tenants/testenv.testtenant/wddata/testfile2
> [root@node1076]# hdfs dfs -ls /data/tenants/testenv.testtenant/wddata/.snapshot
> Found 8 items
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:47 /data/tenants/testenv.testtenant/wddata/.snapshot/sn1
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:47 /data/tenants/testenv.testtenant/wddata/.snapshot/sn2
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:52 /data/tenants/testenv.testtenant/wddata/.snapshot/sn3
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:53 /data/tenants/testenv.testtenant/wddata/.snapshot/sn4
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:57 /data/tenants/testenv.testtenant/wddata/.snapshot/sn5
> drwxr-xr-x   - hdfs hadoop          0 2015-10-04 10:58 /data/tenants/testenv.testtenant/wddata/.snapshot/sn6
> drwxr-xr-x   - hdfs hadoop          0 2015-10-05 03:13 /data/tenants/testenv.testtenant/wddata/.snapshot/sn7
> drwxr-xr-x   - hdfs hadoop          0 2015-10-05 04:20 /data/tenants/testenv.testtenant/wddata/.snapshot/sn8
> [root@node1076]# hdfs dfs -createSnapshot /data/tenants/testenv.testtenant/wddata sn9
> Created snapshot /data/tenants/testenv.testtenant/wddata/.snapshot/sn9
> [root@node1076]# hdfs snapshotDiff /data/tenants/testenv.testtenant/wddata sn8 sn9
> Difference between snapshot sn8 and snapshot sn9 under directory /data/tenants/testenv.testtenant/wddata:
> ################
> ## IMPORTANT: testfile1 was put into HDFS more than 1 hour ago, which triggers the accessTime
update.
> ################
> [root@node1076]# hdfs dfs -cat /data/tenants/testenv.testtenant/wddata/testfile1
> This is test file 1, but now it's 11.
> [root@node1076]# hdfs dfs -createSnapshot /data/tenants/testenv.testtenant/wddata sn10
> Created snapshot /data/tenants/testenv.testtenant/wddata/.snapshot/sn10
> [root@node1076]# hdfs snapshotDiff /data/tenants/testenv.testtenant/wddata sn9 sn10
> Difference between snapshot sn9 and snapshot sn10 under directory /data/tenants/testenv.testtenant/wddata:
> M	./testfile1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message