hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Minearo" <Peter.Mine...@Reardencommerce.com>
Subject RE: Hadoop and XML
Date Fri, 16 Jul 2010 21:44:18 GMT

I am not using multi-threaded Map tasks.  Also, if I understand your second question correctly:
"Also can you try creating the output key and values in the map method(method lacal) ?"
In the first code snippet I am doing exactly that.

Below is the class that runs the Job.

public class HadoopJobClient {

	private static final Log LOGGER = LogFactory.getLog(Prds.class.getName());
	
	public static void main(String[] args) {
		JobConf conf = new JobConf(Prds.class);
		
		conf.set("xmlinput.start", "<PrivateRateSet>");
		conf.set("xmlinput.end", "</PrivateRateSet>");
		
		conf.setJobName("PRDS Parse");

		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(Text.class);

		conf.setMapperClass(PrdsMapper.class);
		conf.setReducerClass(PrdsReducer.class);

		conf.setInputFormat(XmlInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);

		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));

		// Run the job
		try {
			JobClient.runJob(conf);
		} catch (IOException e) {
			LOGGER.error(e.getMessage(), e);
		}

	}
	
	
}




-----Original Message-----
From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
Sent: Fri 7/16/2010 2:29 PM
To: general@hadoop.apache.org
Subject: Re: Hadoop and XML
 
Hi,

Can you please share the code of the job submission client ?

Also can you try creating the output key and values in the map method(method
lacal) ?
Make sure you are not using multi threaded map task configuration.

map()
{
private Text keyText = new Text();
 private Text valueText = new Text();

//rest of the code
}

Soumya.

On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> I have an XML file that has sparse data in it.  I am running a MapReduce
> Job that reads in an XML file, pulls out a Key from within the XML
> snippet and then hands back the Key and the XML snippet (as the Value)
> to the OutputCollector.  The reason is to sort the file back into order.
> Below is the snippet of code.
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  private Text keyText = new Text();
>  private Text valueText = new Text();
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException {
>  Text valueText = (Text)value;
>  String valueString = new String(valueText.getBytes(), "UTF-8");
>  String keyString = getXmlKey(valueString);
>  getKeyText().set(keyString);
>  getValueText().set(valueString);
>  output.collect(getKeyText(), getValueText());
>  }
>
>
>  public Text getKeyText() {
>  return keyText;
>  }
>
>
>  public void setKeyText(Text keyText) {
>  this.keyText = keyText;
>  }
>
>
>  public Text getValueText() {
>  return valueText;
>  }
>
>
>  public void setValueText(Text valueText) {
>  this.valueText = valueText;
>  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> The XML snippet from the Value is fine when it is passed into the map()
> method.  I am not changing any data either, just pulling out information
> for the key.  The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged.  For Example:
>
>  </PrivateRate>
>  </PrivateRateSet>te>
>
> It is my understanding that Hadoop uses the same instance of the Key and
> Value object when calling the Map method.  What changes is the data
> within those instances.  So, I ran an experiment where I do not have
> different Key or Value Text Objects.  I reuse the ones passed into the
> method, like below:
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException {
>  Text keyText = (Text)key;
>  Text valueText = (Text)value;
>  String valueString = new String(valueText.getBytes(), "UTF-8");
>  String keyString = getXmlKey(valueString);
>  keyText.set(keyString);
>  valueText.set(valueString);
>  output.collect(keyText, valueText);
>  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> What was interesting about this is the fact that the XML was getting
> munged within the Map Phase.  When I changed over to the code at the
> top, the Map phase was fine.  However, the Reduce phase picks up the
> munged XML.  Trying to debug the problem, I came across this method in
> the Text Object:
>
> public void set(byte[] utf8, int start, int len) {
>    setCapacity(len, false);
>    System.arraycopy(utf8, start, bytes, 0, len);
>    this.length = len;
> }
>
> If the "bytes" array had a length of 1000 and the "utf8" array has a
> length of 500; doing a System.arraycopy() would only copy the first 500
> from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  Could
> this be the cause of the XML munging?
>
> All of this leads me to a few questions:
>
> 1) Has anyone successfully used XML snippets as the data format within a
> MapReduce job; not just reading from the file but used during the
> shuffle?
> 2) Is anyone seeing this problem with XML or any other format?
> 3) Does anyone know what is going on?
> 4) Is this a bug?
>
>
> Thanks,
>
> Peter
>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message