hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srinivas Surasani <vas...@gmail.com>
Subject Re: Processing small xml files
Date Sat, 18 Feb 2012 07:37:57 GMT
Hi Mohit,

You can use Pig for processing XML files. PiggyBank has build in load
function to load the XML files.
 Also you can specify pig.maxCombinedSplitSize  and
pig.splitCombination for efficient processing.

On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia <mohitanchlia@gmail.com> wrote:
> On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill <billmcn@gmail.com> wrote:
>> I'm not sure what you mean by "flat format" here.
>> In my scenario, I have an file input.xml that looks like this.
>> <myfile>
>>   <section>
>>      <value>1</value>
>>   </section>
>>   <section>
>>      <value>2</value>
>>   </section>
>> </myfile>
>> input.xml is a plain text file. Not a sequence file. If I read it with the
>> XMLInputFormat my mapper gets called with (key, value) pairs that look like
>> this:
>> (nnnn, <section><value>1</value></section>)
>> (nnnn, <section><value>2</value></section>)
>> Where the keys are numerical offsets into the file. I then use this
>> information to write a sequence file with these (key, value) pairs. So my
>> Hadoop job that uses XMLInputFormat takes a text file as input and produces
>> a sequence file as output.
>> I don't know a rule of thumb for how many small files is too many. Maybe
>> someone else on the list can chime in. I just know that when your
>> throughput gets slow that's one possible cause to investigate.
> I need to install hadoop. Does this xmlinput format comes as part of the
> install? Can you please give me some pointers that would help me install
> hadoop and xmlinputformat if necessary?

-- Srinivas

View raw message