hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "W.P. McNeill" <bill...@gmail.com>
Subject Re: Processing small xml files
Date Tue, 14 Feb 2012 18:56:40 GMT
I'm not sure what you mean by "flat format" here.

In my scenario, I have an file input.xml that looks like this.


input.xml is a plain text file. Not a sequence file. If I read it with the
XMLInputFormat my mapper gets called with (key, value) pairs that look like

(nnnn, <section><value>1</value></section>)
(nnnn, <section><value>2</value></section>)

Where the keys are numerical offsets into the file. I then use this
information to write a sequence file with these (key, value) pairs. So my
Hadoop job that uses XMLInputFormat takes a text file as input and produces
a sequence file as output.

I don't know a rule of thumb for how many small files is too many. Maybe
someone else on the list can chime in. I just know that when your
throughput gets slow that's one possible cause to investigate.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message