Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2CD3F11B90 for ; Thu, 20 Feb 2014 16:42:24 +0000 (UTC) Received: (qmail 99508 invoked by uid 500); 20 Feb 2014 16:42:16 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 99401 invoked by uid 500); 20 Feb 2014 16:42:16 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 99394 invoked by uid 99); 20 Feb 2014 16:42:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Feb 2014 16:42:16 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=HTML_IMAGE_ONLY_24,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dan@rentify.com designates 209.85.213.43 as permitted sender) Received: from [209.85.213.43] (HELO mail-yh0-f43.google.com) (209.85.213.43) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Feb 2014 16:42:10 +0000 Received: by mail-yh0-f43.google.com with SMTP id z6so920493yhz.16 for ; Thu, 20 Feb 2014 08:41:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=8FraA7lthPg2Y8Ovbh66G4tn2M7qn1FrGEoxq6Ov+5I=; b=GSP3PEctxzWSCu7JEPw4ROWezDOa4Z6TZ6JBdgnKXCyCkKLL159EmUKJGUaVjZFtu1 1wKYDRjZRLGXG02gkBZX2Qia6V/+4JV5KE5ibAgL1JUIKtqxOZCf4zhQ2jKBRBeNj0hN QQCQcZU9OMJmgNmq4DQh/scY1xekvs2oLFOpVlXDQkHEnQvlsPLxXFp2uPVBKmGGxUVo Yg50akB+8kxQbLGV/GSkFJmSTV8/2VgJyFznR0HXXichkhffEJ+DnjTM/ZfXkcbC27b3 ptG7nTblUOosiM1Ik6ljpG5sfChLi18WBig3Vt8tbuonT5AmixOLRuQp1eNwNY682Uww uR+Q== X-Gm-Message-State: ALoCoQm4KLIYNs4EzUB0sc9s1qqBDzIXE/Udk6tb78+t6pF0Fuj5f60xkUSbuA7Il27iTeHhwKrjBAqo/l+eOCvnLf+6gWrNAfCU7pfkLQ5kqaz+0MxR7+g= MIME-Version: 1.0 X-Received: by 10.236.123.38 with SMTP id u26mr4429046yhh.93.1392914509323; Thu, 20 Feb 2014 08:41:49 -0800 (PST) Received: by 10.170.63.196 with HTTP; Thu, 20 Feb 2014 08:41:49 -0800 (PST) Date: Thu, 20 Feb 2014 16:41:49 +0000 Message-ID: Subject: har file globbing problem From: Dan Buchan To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf3011d9df880fdc04f2d92e8f X-Virus-Checked: Checked by ClamAV on apache.org --20cf3011d9df880fdc04f2d92e8f Content-Type: text/plain; charset=US-ASCII We have a dataset of ~8Milllion files about .5 to 2 Megs each. And we're having trouble getting them analysed after building a har file. The files are already in a pre-existing directory structure, with, two nested set of dirs with 20-100 pdfs at the bottom of each leaf of the dir tree. user->hadoop->/all_the_files/*/*/*.pdf It was trivial to move these to hdfs and to build a har archive; I used the following command to make the archive bin/hadoop archive -archiveName test.har -p /user/hadoop/ all_the_files/*/*/ /user/hadoop/ Listing the contents of the har (bin/hadoop fs -lsr har:///user/hadoop/epc_test.har) and everything looks as I'd expect. When we come to run the hadoop job with this command, trying to wildcard the archive: bin/hadoop jar My.jar har:///user/hadoop/test.har/all_the_files/*/*/ output it fails with the following exception Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string Running the job with the non-archived files is fine i.e: bin/hadoop jar My.jar all_the_files/*/*/ output However this only works for our modest test set of files. Any substantial number of files quickly makes the namenode run out of memory. Can you use file globs with the har archives? Is there a different way to build the archive to just include the files which I've missed? I appreciate that a sequence file might be a better fit for this task but I'd like to know the solution to this issue if there is one. -- t. 020 7739 3277 a. 131 Shoreditch High Street, London E1 6JE --20cf3011d9df880fdc04f2d92e8f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
We have a dataset of ~8Milllion files about .5 to 2 M= egs each. And=20 we're having trouble getting them analysed after building a har file.
The files are already in a pre-existing directory structure, with, two=20 nested set of dirs with 20-100 pdfs at the bottom of each leaf of the=20 dir tree.

user->hadoop->/all_the_files/*/*/*.pdf
=

It was trivial to move these to hdfs and to build a har archive; I= used the following command to make the archive

bin/hadoop archive -= archiveName test.har -p /user/hadoop/ all_the_files/*/*/ /user/hadoop/

Listing the contents of the har (bin/hadoop fs -lsr har:///user/hadoop/= epc_test.har) and everything looks as I'd expect.

When we come t= o run the hadoop job with this command, trying to wildcard the archive:

bin/hadoop jar My.jar har:///user/hadoop/test.har/all_the_files/*/*/ ou= tput

it fails with the following exception

=A0=A0=A0 Exceptio= n in thread "main" java.lang.IllegalArgumentException: Can not cr= eate a Path from an empty string

Running the job with the non-archived files is fine i.e:

=A0=A0= =A0 bin/hadoop jar My.jar all_the_files/*/*/ output

However this only works for our modest test set of files. Any substantial=20 number of files quickly makes the namenode run out of memory.

Can you use file globs with the har archives? Is there a different=20 way to build the archive to just include the files which I've missed?
I appreciate that a sequence file might be a better fit for this task but I'd like to know the solution to this issue if there is one.

=

t. =A0020 7739 3277
a. 131= Shoreditch High Street, London E1 6JE

--20cf3011d9df880fdc04f2d92e8f--