Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C0169200B63 for ; Mon, 15 Aug 2016 19:01:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BE8E6160AA7; Mon, 15 Aug 2016 17:01:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 134EF160A8A for ; Mon, 15 Aug 2016 19:01:03 +0200 (CEST) Received: (qmail 9139 invoked by uid 500); 15 Aug 2016 17:01:02 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 9128 invoked by uid 99); 15 Aug 2016 17:01:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2016 17:01:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5E608C000A for ; Mon, 15 Aug 2016 17:01:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.821 X-Spam-Level: X-Spam-Status: No, score=-0.821 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id E2hfq-5FH7D2 for ; Mon, 15 Aug 2016 17:00:57 +0000 (UTC) Received: from mail-ua0-f181.google.com (mail-ua0-f181.google.com [209.85.217.181]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 1CB785FBB8 for ; Mon, 15 Aug 2016 17:00:57 +0000 (UTC) Received: by mail-ua0-f181.google.com with SMTP id 97so82043950uav.3 for ; Mon, 15 Aug 2016 10:00:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=9wWdKXg9RjVxnmqcEB7JsP6cIAvQM17O6e1/ioOtaEk=; b=dDxaOxXt1wdI2l02N/Iam8HZRLwNTVU+iiu+wXOOgnNtaYClGo0O3PNVYUo95EbHfW jh0VSpr+7n6A6ep6nz9E/fdbTfYeYZ5KcP9eHOLXIJJbTzPVhoGOPVRKi/V/PftLwuhz QA2ARcRZ/4tOuZ9eBSh5/tKaibfEGkYjQtAE4XFRd+xDwXkzjxXn2sYpqUDOAzr+AuZK ggTdtdXXjbOvM8qx/c/lA7qvd8zMNpL/bIWmH+mknj9YxGv+ECw9fgyUMoeqz3CN59ve ry9SpE4mNrfr9PpNVlnrZF+lX6QeyiHhXk68yOkVuXVVo+zvXLT6CWNzxhxKFYiS8op1 oLSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=9wWdKXg9RjVxnmqcEB7JsP6cIAvQM17O6e1/ioOtaEk=; b=JDJlSKClelF0WKCWy9k/Ghb3YfG2LAf5z0Y8iaRhZENb1bQyHKPAk3KgCWgWEb4ru9 qJo+cP11AieS89CjK5f2Id6taOsARmZq+MAqtoU9OORYGZxbLhfPXtLDVooLSqw7eDI+ dHRTo0o8lJ7TJDgCExdsBINV9E/Epdyw+OdRoC88TdVSTBQahvgmWvK7RbrpoXttQGgh 7KfiybxxVQL93dPa1WFKa38p0K9lyLmmKLPLDD2uVcHpfuo/ph/RFLGEER9wuUcUjN57 4IzGjOsVOqh/ed7XuSf4m4QULT6w5zAopre+D1qO+DA03Dvi7P4Vkq+jd6oMuswYP5AB +/DA== X-Gm-Message-State: AEkoouvU43idDyrlH3GrzLQgV5jbL5BU+i36LasGBUz48BDBknERnP3eRrpLggScnsgRZOBQCSGoEoeomtlilQ== X-Received: by 10.31.76.130 with SMTP id z124mr14812270vka.107.1471280455958; Mon, 15 Aug 2016 10:00:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.192.11 with HTTP; Mon, 15 Aug 2016 10:00:35 -0700 (PDT) From: Aaron Turner Date: Mon, 15 Aug 2016 10:00:35 -0700 Message-ID: Subject: Hadoop archives (.har) are really really slow To: user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 archived-at: Mon, 15 Aug 2016 17:01:04 -0000 Basically I want to list all the files in a .har file and compare the file list/sizes to an existing directory in HDFS. The problem is that running commands like: hdfs dfs -ls -R is orders of magnitude slower then running the same command against a live HDFS file system. How much slower? I've calculated it will take ~19 days to list all the files in 250TB worth of content spread between 2 .har files. Is this normal? Can I do this faster (write a map/reduce job/etc?) -- Aaron Turner https://synfin.net/ Twitter: @synfinatic Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org For additional commands, e-mail: user-help@hadoop.apache.org