I am not using multi-threaded Map tasks. Also, if I understand your second question correctly: "Also can you try creating the output key and values in the map method(method lacal) ?" In the first code snippet I am doing exactly that. Below is the class that runs the Job. public class HadoopJobClient { private static final Log LOGGER = LogFactory.getLog(Prds.class.getName()); public static void main(String[] args) { JobConf conf = new JobConf(Prds.class); conf.set("xmlinput.start", ""); conf.set("xmlinput.end", ""); conf.setJobName("PRDS Parse"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(PrdsMapper.class); conf.setReducerClass(PrdsReducer.class); conf.setInputFormat(XmlInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); // Run the job try { JobClient.runJob(conf); } catch (IOException e) { LOGGER.error(e.getMessage(), e); } } } -----Original Message----- From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com] Sent: Fri 7/16/2010 2:29 PM To: general@hadoop.apache.org Subject: Re: Hadoop and XML Hi, Can you please share the code of the job submission client ? Also can you try creating the output key and values in the map method(method lacal) ? Make sure you are not using multi threaded map task configuration. map() { private Text keyText = new Text(); private Text valueText = new Text(); //rest of the code } Soumya. On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < Peter.Minearo@reardencommerce.com> wrote: > I have an XML file that has sparse data in it. I am running a MapReduce > Job that reads in an XML file, pulls out a Key from within the XML > snippet and then hands back the Key and the XML snippet (as the Value) > to the OutputCollector. The reason is to sort the file back into order. > Below is the snippet of code. > > public class XmlMapper extends MapReduceBase implements Mapper { > > private Text keyText = new Text(); > private Text valueText = new Text(); > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > getKeyText().set(keyString); > getValueText().set(valueString); > output.collect(getKeyText(), getValueText()); > } > > > public Text getKeyText() { > return keyText; > } > > > public void setKeyText(Text keyText) { > this.keyText = keyText; > } > > > public Text getValueText() { > return valueText; > } > > > public void setValueText(Text valueText) { > this.valueText = valueText; > } > > > private String getXmlKey(String value) { > // Get the Key from the XML in the value. > } > > } > > The XML snippet from the Value is fine when it is passed into the map() > method. I am not changing any data either, just pulling out information > for the key. The problem I am seeing is between the Map phase and the > Reduce phase, the XML is getting munged. For Example: > > > te> > > It is my understanding that Hadoop uses the same instance of the Key and > Value object when calling the Map method. What changes is the data > within those instances. So, I ran an experiment where I do not have > different Key or Value Text Objects. I reuse the ones passed into the > method, like below: > > public class XmlMapper extends MapReduceBase implements Mapper { > > @SuppressWarnings("unchecked") > public void map(Object key, Object value, OutputCollector output, > Reporter reporter) throws IOException { > Text keyText = (Text)key; > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), "UTF-8"); > String keyString = getXmlKey(valueString); > keyText.set(keyString); > valueText.set(valueString); > output.collect(keyText, valueText); > } > > > private String getXmlKey(String value) { > // Get the Key from the XML in the value. > } > > } > > What was interesting about this is the fact that the XML was getting > munged within the Map Phase. When I changed over to the code at the > top, the Map phase was fine. However, the Reduce phase picks up the > munged XML. Trying to debug the problem, I came across this method in > the Text Object: > > public void set(byte[] utf8, int start, int len) { > setCapacity(len, false); > System.arraycopy(utf8, start, bytes, 0, len); > this.length = len; > } > > If the "bytes" array had a length of 1000 and the "utf8" array has a > length of 500; doing a System.arraycopy() would only copy the first 500 > from "utf8" to "bytes" but leave the last 500 in "bytes" alone. Could > this be the cause of the XML munging? > > All of this leads me to a few questions: > > 1) Has anyone successfully used XML snippets as the data format within a > MapReduce job; not just reading from the file but used during the > shuffle? > 2) Is anyone seeing this problem with XML or any other format? > 3) Does anyone know what is going on? > 4) Is this a bug? > > > Thanks, > > Peter > > >