Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 560C110661 for ; Wed, 24 Jul 2013 21:05:51 +0000 (UTC) Received: (qmail 43775 invoked by uid 500); 24 Jul 2013 21:05:50 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 43735 invoked by uid 500); 24 Jul 2013 21:05:50 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 43698 invoked by uid 99); 24 Jul 2013 21:05:50 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jul 2013 21:05:50 +0000 Date: Wed, 24 Jul 2013 21:05:50 +0000 (UTC) From: "Dave Latham (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-8778) Region assigments scan table directory making them slow for huge tables MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Latham updated HBASE-8778: ------------------------------- Attachment: HBASE-8778.patch Attached HBASE-8778.patch which is a patch for trunk. Up at reviewboard at: https://reviews.apache.org/r/12920/ This patch is similar to HBASE-8778-0.94.5-v2 but because it is for trunk only it does not keep two copies (one in tabledir another in .tabledesc subdir) - it only keeps table info files in the known subdir. As a result it also does not use any locking. The patch still includes some cleanup and refactoring of FSTableDescriptors to consistently enforce the fsreadonly flag and use more instance methods than static methods in order to do so. It also changes snapshots to likewise store their table descriptors in the .tabledesc subdirectory so they continue to share code and look just like table directories (option #1 from the previous comment.) It also adds a migration step to HMaster.finishInitialization which migrates existing snapshot directories, user tables, and system tables to store descriptors in the .tabledesc subdirectory. Going to run it by Hadoop QA and welcome review and comment. > Region assigments scan table directory making them slow for huge tables > ----------------------------------------------------------------------- > > Key: HBASE-8778 > URL: https://issues.apache.org/jira/browse/HBASE-8778 > Project: HBase > Issue Type: Improvement > Reporter: Dave Latham > Assignee: Dave Latham > Fix For: 0.98.0, 0.95.2, 0.94.11 > > Attachments: 8778-dirmodtime.txt, HBASE-8778-0.94.5.patch, HBASE-8778-0.94.5-v2.patch, HBASE-8778.patch > > > On a table with 130k regions it takes about 3 seconds for a region server to open a region once it has been assigned. > Watching the threads for a region server running 0.94.5 that is opening many such regions shows the thread opening the reigon in code like this: > {noformat} > "PRI IPC Server handler 4 on 60020" daemon prio=10 tid=0x00002aaac07e9000 nid=0x6566 runnable [0x000000004c46d000] > java.lang.Thread.State: RUNNABLE > at java.lang.String.indexOf(String.java:1521) > at java.net.URI$Parser.scan(URI.java:2912) > at java.net.URI$Parser.parse(URI.java:3004) > at java.net.URI.(URI.java:736) > at org.apache.hadoop.fs.Path.initialize(Path.java:145) > at org.apache.hadoop.fs.Path.(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:50) > at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:215) > at org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:252) > at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:311) > at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:159) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867) > at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1168) > at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:269) > at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:255) > at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoModtime(FSTableDescriptors.java:368) > at org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:155) > at org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:126) > at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2834) > at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2807) > at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) > at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) > {noformat} > To open the region, the region server first loads the latest HTableDescriptor. Since HBASE-4553 HTableDescriptor's are stored in the file system at "/hbase//.tableinfo.". The file with the largest sequenceNum is the current descriptor. This is done so that the current descirptor is updated atomically. However, since the filename is not known in advance FSTableDescriptors it has to do a FileSystem.listStatus operation which has to list all files in the directory to find it. The directory also contains all the region directories, so in our case it has to load 130k FileStatus objects. Even using a globStatus matching function still transfers all the objects to the client before performing the pattern matching. Furthermore HDFS uses a default of transferring 1000 directory entries in each RPC call, so it requires 130 roundtrips to the namenode to fetch all the directory entries. > Consequently, to reassign all the regions of a table (or a constant fraction thereof) requires time proportional to the square of the number of regions. > In our case, if a region server fails with 200 such regions, it takes 10+ minutes for them all to be reassigned, after the zk expiration and log splitting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira