Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of dan@rentify.com designates
 209.85.213.43 as permitted sender)
MIME-Version: 1.0
Date: Thu, 20 Feb 2014 16:41:49 +0000
Message-ID: 
 <CAOZaLCAETLrGxnwT8N7vP5cOBaBF5cDm2KT1T7vXH8mE8E+B0w@mail.gmail.com>
Subject: har file globbing problem
From: Dan Buchan <dan@rentify.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3011d9df880fdc04f2d92e8f

--20cf3011d9df880fdc04f2d92e8f
Content-Type: text/plain; charset=US-ASCII

We have a dataset of ~8Milllion files about .5 to 2 Megs each. And we're
having trouble getting them analysed after building a har file.

The files are already in a pre-existing directory structure, with, two
nested set of dirs with 20-100 pdfs at the bottom of each leaf of the dir
tree.

user->hadoop->/all_the_files/*/*/*.pdf

It was trivial to move these to hdfs and to build a har archive; I used the
following command to make the archive

bin/hadoop archive -archiveName test.har -p /user/hadoop/
all_the_files/*/*/ /user/hadoop/

Listing the contents of the har (bin/hadoop fs -lsr
har:///user/hadoop/epc_test.har) and everything looks as I'd expect.

When we come to run the hadoop job with this command, trying to wildcard
the archive:

bin/hadoop jar My.jar har:///user/hadoop/test.har/all_the_files/*/*/ output

it fails with the following exception

    Exception in thread "main" java.lang.IllegalArgumentException: Can not
create a Path from an empty string

Running the job with the non-archived files is fine i.e:

    bin/hadoop jar My.jar all_the_files/*/*/ output

However this only works for our modest test set of files. Any substantial
number of files quickly makes the namenode run out of memory.

Can you use file globs with the har archives? Is there a different way to
build the archive to just include the files which I've missed?
I appreciate that a sequence file might be a better fit for this task but
I'd like to know the solution to this issue if there is one.

-- 
 

t.  020 7739 3277
a. 131 Shoreditch High Street, London E1 6JE

--20cf3011d9df880fdc04f2d92e8f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>We have a dataset of ~8Milllion files about .5 to 2 M=
egs each. And=20
we&#39;re having trouble getting them analysed after building a har file.<b=
r><br>The
 files are already in a pre-existing directory structure, with, two=20
nested set of dirs with 20-100 pdfs at the bottom of each leaf of the=20
dir tree.<br>
<br></div>user-&gt;hadoop-&gt;/all_the_files/*/*/*.pdf<br><div dir=3D"ltr">=
<div><br>It was trivial to move these to hdfs and to build a har archive; I=
 used the following command to make the archive<br><br>bin/hadoop archive -=
archiveName test.har -p /user/hadoop/ all_the_files/*/*/ /user/hadoop/<br>

<br>Listing the contents of the har (bin/hadoop fs -lsr har:///user/hadoop/=
epc_test.har) and everything looks as I&#39;d expect.<br><br>When we come t=
o run the hadoop job with this command, trying to wildcard the archive:<br>

<br>bin/hadoop jar My.jar har:///user/hadoop/test.har/all_the_files/*/*/ ou=
tput<br><br>it fails with the following exception<br><br>=A0=A0=A0 Exceptio=
n in thread &quot;main&quot; java.lang.IllegalArgumentException: Can not cr=
eate a Path from an empty string<br>

<br>Running the job with the non-archived files is fine i.e:<br><br>=A0=A0=
=A0 bin/hadoop jar My.jar all_the_files/*/*/ output<br><br>However
 this only works for our modest test set of files. Any substantial=20
number of files quickly makes the namenode run out of memory.<br>
<br>Can you use file globs with the har archives? Is there a different=20
way to build the archive to just include the files which I&#39;ve missed?<b=
r></div><div>I
 appreciate that a sequence file might be a better fit for this task but
 I&#39;d like to know the solution to this issue if there is one.</div></di=
v></div>

<br>


<p style=3D"font-family:Arial,Helvetica,sans-serif;font-size:1.3em"><img sr=
c=3D"http://s3.amazonaws.com/rentify-com/marketing-assets/rentify.png"></p>=
<p style=3D"font-size:1.3em"><font face=3D"Arial"><span style=3D"font-size:=
small">t. =A0020 7739 3277<br></span><span style=3D"font-size:small">a. 131=
 Shoreditch High Street, London E1 6JE</span></font></p>
--20cf3011d9df880fdc04f2d92e8f--