Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 891AD160C15 for ; Wed, 3 Jan 2018 09:42:06 +0100 (CET) Received: (qmail 68958 invoked by uid 500); 3 Jan 2018 08:42:05 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 68947 invoked by uid 99); 3 Jan 2018 08:42:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jan 2018 08:42:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A47381A08FB for ; Wed, 3 Jan 2018 08:42:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.011 X-Spam-Level: X-Spam-Status: No, score=-100.011 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id XbTC7zYK1Oh9 for ; Wed, 3 Jan 2018 08:42:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id E525B5FACE for ; Wed, 3 Jan 2018 08:42:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C75FDE0F55 for ; Wed, 3 Jan 2018 08:42:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5E866240F2 for ; Wed, 3 Jan 2018 08:42:00 +0000 (UTC) Date: Wed, 3 Jan 2018 08:42:00 +0000 (UTC) From: "Vinayakumar B (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-12502) SetReplication OutOfMemoryError MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 03 Jan 2018 08:42:07 -0000 [ https://issues.apache.org/jira/browse/HADOOP-12502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309305#comment-16309305 ] Vinayakumar B commented on HADOOP-12502: ---------------------------------------- bq. One question though: is it necessary to introduce a new FileSystem API listStatusIterator(final Path p, final PathFilter filter)? From my perspective it seems a useful addition, but doesn't need to be included in this patch. Adding a new FileSystem API is always concerning. Oh okay. I had added to pass the custom filter as well to iterator. Anyway that can be another Jira itself. bq. Do we know where the most memory was going? Is it the references to all the listStatus() arrays held in the full recursion call tree in fs/shell/Command.java? (Each recursive call passes reference to that level's listStatus() array, meaning whole tree will be held in heap, right?) Yes, thats right. Whole tree was held in client(command JVM) memory, causing the OOM. bq. How does this patch fix the OOM issue? Is it because we're now holding RemoteIterator's for the whole directory tree in memory, instead of holding the actual listStatus arrays? This problem is seen when each level of directory contains huge number of children. Consider an example with just 2 level: Parent : */dir1* -- contains 10000 entries. Each child dir in /dir1 contains 10000 entries. /dir1/subdir1 --> 10000 entries /dir1/subdir2 --> 10000 entries /dir1/subdir3 --> 10000 entries . . /dir1/subdir10000 --> 10000 entries So total : *10000*10000* entries. While processing each subdirectory, *at least 20000 *entries (10000 for parent + 10000 for current subdir) should be kept in memory. This memory consumption increases as the number of level increases. With ListStatusIterator implementation (with Limit of 1000 entries at one call) this could be reduced. *At most 2000 entries* only need to be keep in memory for above problem( 1000 for parent + 1000 for current subdir). Remaining entries will be loaded on the fly. *Please Note that LocalFileSystem implementation there will not be any difference.* as listStatusIterator() api uses the listStatus() itself internally. But this could benefit the HDFS implementation which has the listStatusIterator() API implementation at serverside bq. Why are we forcing ChecksumFilesystem to not use listStatusIterator(), and sorting the results here? This could increase memory usage, no? I don't think sorted iterator is required by the FS contract. Yes, you are right. FS contract does not ask for sorted items. Removed the _Arrays.sort()._ But since _FileSystem#DEFAULT_FILTER_ includes crc files also, _listStatusIterator()_ should be overridden here. bq. The Ls.java changes seem tricky. I wonder if there is a simpler way of doing this (idea: Command exposes an overridable boolean isSorted() predicate that Ls.java can override if it needs sorting, and leave the traversal logic in Command instead of mucking with it in Ls?) Yes, thats good idea, Thank you. Done. bq. This comment is still true? I'm guessing the intent was "iterative" as in "not recursive", instead of "iterative" as in "using an iterator" I think intention was to use iterator instead of items array. This patch does exactly that by adding overloaded {{processPaths(PathData parent, RemoteIterator itemsIterator)}} method. So removed the comment. bq. You mean "non-recursive", right? Or maybe "non-iterator". I meant "non-iterator" method. i.e. Legacy method. bq. What about the depth++ depth-- accounting in Command.recursePaths() that you skip here? Is the logic that Ls does not use getDepth()? Seems brittle. Yes, I missed this. Now since {{recursePath()}} handled in {{Command}} itself, {{depth}} is tracked properly. bq. Why does PathData.getDirectoryContents() sort its listing? sorting was added for HADOOP-8140 bq. I guess this is much of the memory savings. I guess this chunking into 100 works without changing the depth-first search ordering. I didnt get you here. can you please explain bq. What about the sorting in existing Ls#processPaths()? That changes because we now only sort the batches of 100. This doesnt change the sorting. If sorting was required, then iterator will be avoided in first place. This grouping of 100 items is only required to format the output by calling {{adjustColumnWidths()}} and make it readable. Otherwise {{adjustColumnWidths()}} will be useless. bq. I like the idea of chunking the depth first search (DFS) into blocks of 100 and releasing references on the way up. Wouldn't we want to do this in Command instead of Ls? Two reasons: (1) other commands benefit (2) less brittle in terms of how recursion logic is wired up between Command and Ls. Thanks for the suggestion. Moved to Command itself. Will post a new patch soon. > SetReplication OutOfMemoryError > ------------------------------- > > Key: HADOOP-12502 > URL: https://issues.apache.org/jira/browse/HADOOP-12502 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.3.0 > Reporter: Philipp Schuegerl > Assignee: Vinayakumar B > Attachments: HADOOP-12502-01.patch, HADOOP-12502-02.patch, HADOOP-12502-03.patch, HADOOP-12502-04.patch, HADOOP-12502-05.patch, HADOOP-12502-06.patch, HADOOP-12502-07.patch > > > Setting the replication of a HDFS folder recursively can run out of memory. E.g. with a large /var/log directory: > hdfs dfs -setrep -R -w 1 /var/log > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOfRange(Arrays.java:2694) > at java.lang.String.(String.java:203) > at java.lang.String.substring(String.java:1913) > at java.net.URI$Parser.substring(URI.java:2850) > at java.net.URI$Parser.parse(URI.java:3046) > at java.net.URI.(URI.java:753) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) > at org.apache.hadoop.fs.Path.(Path.java:116) > at org.apache.hadoop.fs.Path.(Path.java:94) > at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:222) > at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:246) > at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:689) > at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268) > at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) > at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) > at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) > at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) > at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244) > at org.apache.hadoop.fs.shell.SetReplication.processArguments(SetReplication.java:76) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org