hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vipul Sharma <sharmavi...@gmail.com>
Subject Re: XML input to map function
Date Tue, 03 Nov 2009 00:39:22 GMT
Okay I think I was not clear in my first post about the question. Let me try

I have an application that gets large number of xml files every minute which
are copied over to hdfs. Each file is around 1Mb each and contains several
records. Files are well formed xml files with a starting tag <startingtag>
and end tag </startingtag> in each xml file. I want to parse these files and
put relevant output data in hbase.

Now as an input to map function I can read all the unread files in a string
and parse them inside map function using DOM or sometjing like that. But
then how do I deal with multiple starting tag <startingtag>and ending tag
</startingtag>in the string since we concatenated several files together.
And how do I manage splits since hadoop would want to split at every default
setting which might break the well  formed structure of the xml files.

Other way to go about would be to have a for loop in the driver class and
provide a file at a time. I dont think it is good way since files are very
small and we will get almost no parallelization here.

Is there a way that I can input a list or array of files to map function and
do parsing inside map function. How would I take care of split and the tags
of xml if I do that.

I hope I was more clear this time??

Vipul Sharma,
Cell: 281-217-0761

On Mon, Nov 2, 2009 at 4:00 PM, Amandeep Khurana <amansk@gmail.com> wrote:

> Are the xml's in flat files or stored in Hbase?
> 1. If they are in flat files, you can use the StreamXmlRecordReader if that
> works for you.
> 2. Or you can read the xml into a single string and process it however you
> want. (This can be done if its in a flat file or stored in an hbase table).
> I have xmls in hbase table and parse and process them as strings.
> One mapper per file doesnt make sense. If its in HBase, have one mapper per
> region. If they are flat files, depending on how many files you have, you
> can create mappers. You can tune this for your particular requirement and
> there is no "right" way to do it.
> On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma <sharmavipul@gmail.com>
> wrote:
> > I am working on a mapreduce application that will take input from lots of
> > small xml files rather than one big xml file. Each xml files has some
> > record
> > that I want to parse and input data in a hbase table. How should I go
> about
> > parsing xml files and input in map functions. Should I have one mapper
> per
> > xml file or is there another way of doing this? Thanks for your help and
> > time.
> >
> > Regards,
> > Vipul Sharma,
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message