hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: XML input to map function
Date Tue, 03 Nov 2009 00:58:15 GMT
On Mon, Nov 2, 2009 at 4:39 PM, Vipul Sharma <sharmavipul@gmail.com> wrote:

> Okay I think I was not clear in my first post about the question. Let me
> try
> again.
>
> I have an application that gets large number of xml files every minute
> which
> are copied over to hdfs. Each file is around 1Mb each and contains several
> records. Files are well formed xml files with a starting tag <startingtag>
> and end tag </startingtag> in each xml file. I want to parse these files
> and
> put relevant output data in hbase.


> Now as an input to map function I can read all the unread files in a string
> and parse them inside map function using DOM or sometjing like that. But
> then how do I deal with multiple starting tag <startingtag>and ending tag
> </startingtag>in the string since we concatenated several files together.
> And how do I manage splits since hadoop would want to split at every
> default
> setting which might break the well  formed structure of the xml files.
>
>
So you have multiple xmls in a single file and you have many such files..
In that case, the best answer is the StreamXmlRecordReader.

Or you can write your own InputFormat to create splits such that each split
in an xml file in itself, or each record in a split is a complete xml
message.

Other way to go about would be to have a for loop in the driver class and
> provide a file at a time. I dont think it is good way since files are very
> small and we will get almost no parallelization here.
>
> Is there a way that I can input a list or array of files to map function
> and
> do parsing inside map function. How would I take care of split and the tags
> of xml if I do that.
>
> I hope I was more clear this time??
>
> Regards,
> Vipul Sharma,
> Cell: 281-217-0761
>
>
> On Mon, Nov 2, 2009 at 4:00 PM, Amandeep Khurana <amansk@gmail.com> wrote:
>
> > Are the xml's in flat files or stored in Hbase?
> >
> > 1. If they are in flat files, you can use the StreamXmlRecordReader if
> that
> > works for you.
> >
> > 2. Or you can read the xml into a single string and process it however
> you
> > want. (This can be done if its in a flat file or stored in an hbase
> table).
> > I have xmls in hbase table and parse and process them as strings.
> >
> > One mapper per file doesnt make sense. If its in HBase, have one mapper
> per
> > region. If they are flat files, depending on how many files you have, you
> > can create mappers. You can tune this for your particular requirement and
> > there is no "right" way to do it.
> >
> > On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma <sharmavipul@gmail.com>
> > wrote:
> >
> > > I am working on a mapreduce application that will take input from lots
> of
> > > small xml files rather than one big xml file. Each xml files has some
> > > record
> > > that I want to parse and input data in a hbase table. How should I go
> > about
> > > parsing xml files and input in map functions. Should I have one mapper
> > per
> > > xml file or is there another way of doing this? Thanks for your help
> and
> > > time.
> > >
> > > Regards,
> > > Vipul Sharma,
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message