hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bibek Paudel <eternalyo...@gmail.com>
Subject Re: question about processing XML file
Date Tue, 12 Oct 2010 11:48:31 GMT
On Fri, Oct 8, 2010 at 1:48 PM, Bibek Paudel <eternalyouth@gmail.com> wrote:
> Hi,
> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
> in a single node cluster. I have already run mapreduce programs for
> wordcount and building inverted index.
> I am trying to run the wordcount program in a wikipedia dump. It is a
> single XML file where each line contains a Wikipedia page in the
> following format:
> <page>     <title>Main Page</title>    <text>Some text goes
> here.</text>    </page>
> I want to do wordcount of the text contained inside the tags <text>
> and </text>. Please let me know what is the correct way of doing this.
> When I enter the following command, I get an error. The jar file, the
> WordCount class and input file all exist.
> $HADOOP_HOME/bin/hadoop jar WordCount.jar -inputformat
> "org.apache.hadoop.mapreduce.StreamInputFormat"
> -Dstream.recordreader.class=org.apache.hadoop.streaming.StreamXmlRecordReader
>  -inputreader "StreamXmlRecordReader,begin=<text>,end=</text>"
> WordCount wikixml wikixml-op2
> Error:
> -----------
> Exception in thread "main" java.lang.ClassNotFoundException: -inputformat
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
> What used to work:
> ----------------------------
> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

Straight out of documentation, the following also works:

$HADOOP_HOME/bin/hadoop jar
contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
"StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
-output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

What I am interested in doing is:
1. use my java classes (written earlier for normal text files) as
mapper and reducer (and driver).
2. if possible, pass the configuration options, like begin and end
tags of XML from inside my Java program itself.
3. if possible, specify my intent to use StreamXmlRecordReader from
inside the java program itself.

Please let me know what I should read/do to solve these issues.


> Thanks for any help,
> Bibek

View raw message