From user-return-1465-archive-asf-public=cust-asf.ponee.io@kudu.apache.org Mon Aug 13 23:41:13 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C3316180629 for ; Mon, 13 Aug 2018 23:41:12 +0200 (CEST) Received: (qmail 80494 invoked by uid 500); 13 Aug 2018 21:41:11 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 80482 invoked by uid 99); 13 Aug 2018 21:41:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2018 21:41:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id D4051C15A3 for ; Mon, 13 Aug 2018 21:41:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.121 X-Spam-Level: X-Spam-Status: No, score=-0.121 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Ybvrx1mKSxeD for ; Mon, 13 Aug 2018 21:41:09 +0000 (UTC) Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 924415F260 for ; Mon, 13 Aug 2018 21:41:09 +0000 (UTC) Received: by mail-it0-f44.google.com with SMTP id h20-v6so16589728itf.2 for ; Mon, 13 Aug 2018 14:41:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=lbIDArt7w5oak165WmAiWF98e2Ol0oT8/yk8s+w3yAM=; b=kakQ/hbjk3xig6aWcWkfSfnQ9XJ1u9cI841ccL+KpYdziy1Eth1K9qN0ZZ6FdrSIWH Fz4rCTicPJxPc9QJ+Azyfv90TEdE6MCCEAl6t7MtXrnlIv/5vdrkm4ECd5NgmuHB8Knr 6Cecy6nuNGCj95sYDNOvcPy4vXbgjL1DHdaUWp3ftxTdtLxRhuqZtZNnqJpLoVkmwHwx yf2vWovpG6vznMNsZLC53otLfwWpYC+snaZdQeLt75UrdXwWcPdXHpyJ/v7lCrWg4UQX scpdQP6QMOSUQ6l0WZ8chSH++3tLa0Ua6k3zO51dyBDZwkUjuMIGJ7Qv4O+tkoXzuY1M Ogpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=lbIDArt7w5oak165WmAiWF98e2Ol0oT8/yk8s+w3yAM=; b=D6isMamdd4NoOG4eTyHfSyVEIFv6zd9mCT4SVY3ZIaS3VOngV9KF0lfcJYvF9r4Amm TcU3G+o19AeNMVNNnY8qYskT02cvz+SAdDCljxI4OsBvXvicCyuw7d+zuBWVH0UzPJkN cUJZbAPESef4Z8aDs7UeCaDFw0fH9xwPEaHToRaNIGlnZbS7eRFpmCvtywPYY21WNX18 QcRG/rP8ox53NfMAq5VcQpmxZV8nnYEf0RZbC/Mu3r/CZ4vVoeEIwjVLh4a53bPkbFMy kNT86I6z/DApA58FsgOxW1Wiv4Y/Z5rJy4y8g+7lP4OGpfzYNjwe9W6GThZKeQm7LDyq YRNg== X-Gm-Message-State: AOUpUlFJW8xgMg22Jy9d+PIvWABg3LfuGiNjcJMoXPczUFaKutInua/7 Mmll6LYa4SnGqbfRlBwo9VPvm4elf/Q1+d+fbprfL5yERO0= X-Google-Smtp-Source: AA+uWPzJMqBAD+icwZzFn4leq0c995HVfGMHe/Xcmv7on2IXt/S4doh0StysJkdt5Vf7MMpiwdVj1VqHgqvd8pT5yUM= X-Received: by 2002:a24:8dc6:: with SMTP id w189-v6mr4823010itd.69.1534196463015; Mon, 13 Aug 2018 14:41:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Adar Lieber-Dembo Date: Mon, 13 Aug 2018 14:40:52 -0700 Message-ID: Subject: Re: How to decrease kudu server restart time To: user@kudu.apache.org Content-Type: text/plain; charset="UTF-8" > Even if the kudu server started, it also spent too much copying tablet, as the following tablet block copying log: > > > Tablet 1ecbe230e14a4d9f9125dbc49c32860e of table 'impala::venus.ods_xk_pay_fee_order' is under-replicated: 1 replica(s) not RUNNING > 41e4489d38924c85a4810bd33ef60d80 (bj-yz-hadoop01-1-12:7050): bad state > State: INITIALIZED > Data state: TABLET_DATA_COPYING > Last status: Tablet Copy: Downloading block 0000000084111077 (299837/1177225) > 52a9ede038a04566860ecd2e54388738 (bj-yz-hadoop01-1-51:7050): RUNNING > b133f6fd0c274b93b21ffcbdcbbde830 (bj-yz-hadoop01-1-14:7050): RUNNING [LEADER] I see that this tablet has over a million blocks, but how are you measuring that it's spending too much time copying? How much time did it take to fully copy this tablet? > 1. It seems kudu server spent a long time to open log block container, how to speed up restarting kudu server ? Your Kudu server log should contain some log messages that'll help us understand what's going on. Look for a message like "Time spent opening block manager" and paste that. Also can you find and paste the "FS layout report"? In general, the more blocks (and thus block containers) you have, the longer it'll take Kudu to restart. KUDU-2014 has some ideas on how we might improve this. Once a tserver is deemed dead and its data is rereplicated elsewhere, you can just reformat the node (i.e. delete the contents of the WAL, metadata, and data directories). Its contents are no longer necessary, and this will reset the number of log block containers to 0, which will speed up subsequent restarts. > 2. I think the number of blocks have an influence on kudu server restarting time and query time on specific tablet, more number of blocks, more restarting time and query time. Is this right ? Yes to restarting time, but not necessarily to query time. It really depends on the kinds of queries you're issuing, how many predicates they have, etc. > 3. Why there are more than 1 million blocks in a tablet, as shown in above Tablet Copy log, while there are less than 500 thousands of records in the tablet ? That's an excellent question. What kind of write workload do you have? What's your table schema and partitioning? Do you have any non-standard flags defined that may affect how Kudu flushes or compacts its data? I'd also suggest running the CLI tool 'kudu local_replica data_size' on that large replica you described above. It will help identify whether this is a case of very large tablets, or just high numbers of blocks. > 4. How to reduce the number of block in tablet ? Once you answer the questions I posed just above, I might be able to offer some recommendations for how to reduce the overall number of blocks.