Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2D6C8184AA for ; Thu, 4 Feb 2016 21:21:40 +0000 (UTC) Received: (qmail 32612 invoked by uid 500); 4 Feb 2016 21:21:40 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 32576 invoked by uid 500); 4 Feb 2016 21:21:40 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 32555 invoked by uid 99); 4 Feb 2016 21:21:39 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2016 21:21:39 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C43632C14F0 for ; Thu, 4 Feb 2016 21:21:39 +0000 (UTC) Date: Thu, 4 Feb 2016 21:21:39 +0000 (UTC) From: "Eric Newton (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (ACCUMULO-4120) large root tablet causes system failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Newton updated ACCUMULO-4120: ---------------------------------- Fix Version/s: (was: 1.8.0) 1.7.0 > large root tablet causes system failure > --------------------------------------- > > Key: ACCUMULO-4120 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4120 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.6.4, 1.7.0 > Environment: 300 node test cluster > Reporter: Eric Newton > Fix For: 1.7.0 > > > On a large test cluster, a table was storing very large row keys that were similar for 1 - 10 M per row id (which, is a bad schema... but not the problem, yet). > Large row keys made the tablet large, so it split. And the first 1-10M of the row keys were stored in the metadata table. > The metadata table has a small split size, so it split. > This ended up recording several keys in the root tablet that were very large. For example, a single metadata table file was 700M (compressed) and contained 34 keys. > The problem is that *everyone* wants to read the root tablet to find the metadata tablets. And that was causing the tablet server hosting the tablet to run out of heap. > Possible solution: bring down the cluster, put it in "safe mode" where only the metadata table is brought online. Raise the split size of the metadata table to something large (1G?). Merge the metadata table which should remove the large records from the root tablet. > There's a utility (SplitLarge) than can be used to remove large keys from the RFiles of the offending table. Once the ridiculous keys are stripped out the table can be brought online and merged, which will remove the large keys from the metadata table. > As long is this is done on a small number of nodes, the servers should have enough memory to satisfy the requests to perform the metadata table queries and updates. > We may want to consider adding key size to the metadata table constraint to prevent these things in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)