hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Processing small xml files
Date Sat, 18 Feb 2012 14:12:47 GMT
On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani <vasajb@gmail.com>wrote:

> Hi Mohit,
>
> You can use Pig for processing XML files. PiggyBank has build in load
> function to load the XML files.
>  Also you can specify pig.maxCombinedSplitSize  and
> pig.splitCombination for efficient processing.
>

I can't seem to find examples of how to do xml processing in Pig. Can you
please send me some pointers? Basically I need to convert my xml to more
structured format using hadoop to write it to database.

>
> On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia <mohitanchlia@gmail.com>
> wrote:
> > On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill <billmcn@gmail.com>
> wrote:
> >
> >> I'm not sure what you mean by "flat format" here.
> >>
> >> In my scenario, I have an file input.xml that looks like this.
> >>
> >> <myfile>
> >>   <section>
> >>      <value>1</value>
> >>   </section>
> >>   <section>
> >>      <value>2</value>
> >>   </section>
> >> </myfile>
> >>
> >> input.xml is a plain text file. Not a sequence file. If I read it with
> the
> >> XMLInputFormat my mapper gets called with (key, value) pairs that look
> like
> >> this:
> >>
> >> (nnnn, <section><value>1</value></section>)
> >> (nnnn, <section><value>2</value></section>)
> >>
> >> Where the keys are numerical offsets into the file. I then use this
> >> information to write a sequence file with these (key, value) pairs. So
> my
> >> Hadoop job that uses XMLInputFormat takes a text file as input and
> produces
> >> a sequence file as output.
> >>
> >> I don't know a rule of thumb for how many small files is too many. Maybe
> >> someone else on the list can chime in. I just know that when your
> >> throughput gets slow that's one possible cause to investigate.
> >>
> >
> > I need to install hadoop. Does this xmlinput format comes as part of the
> > install? Can you please give me some pointers that would help me install
> > hadoop and xmlinputformat if necessary?
>
>
>
> --
> -- Srinivas
> Srinivas@Cloudwick.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message