hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Minearo" <Peter.Mine...@Reardencommerce.com>
Subject Hadoop and XML
Date Fri, 16 Jul 2010 21:00:00 GMT
I have an XML file that has sparse data in it.  I am running a MapReduce
Job that reads in an XML file, pulls out a Key from within the XML
snippet and then hands back the Key and the XML snippet (as the Value)
to the OutputCollector.  The reason is to sort the file back into order.
Below is the snippet of code. 
 
public class XmlMapper extends MapReduceBase implements Mapper {
 
 private Text keyText = new Text();
 private Text valueText = new Text();
 
 @SuppressWarnings("unchecked")
 public void map(Object key, Object value, OutputCollector output,
Reporter reporter) throws IOException {
  Text valueText = (Text)value;
  String valueString = new String(valueText.getBytes(), "UTF-8");
  String keyString = getXmlKey(valueString);
  getKeyText().set(keyString);
  getValueText().set(valueString);
  output.collect(getKeyText(), getValueText());
 }
 
 
 public Text getKeyText() {
  return keyText;
 }
 

 public void setKeyText(Text keyText) {
  this.keyText = keyText;
 }
 

 public Text getValueText() {
  return valueText;
 }
 

 public void setValueText(Text valueText) {
  this.valueText = valueText;
 }
 

 private String getXmlKey(String value) {
        // Get the Key from the XML in the value.
 }
 
}
 
The XML snippet from the Value is fine when it is passed into the map()
method.  I am not changing any data either, just pulling out information
for the key.  The problem I am seeing is between the Map phase and the
Reduce phase, the XML is getting munged.  For Example:
 
 </PrivateRate>
  </PrivateRateSet>te>
 
It is my understanding that Hadoop uses the same instance of the Key and
Value object when calling the Map method.  What changes is the data
within those instances.  So, I ran an experiment where I do not have
different Key or Value Text Objects.  I reuse the ones passed into the
method, like below:
 
public class XmlMapper extends MapReduceBase implements Mapper {
 
 @SuppressWarnings("unchecked")
 public void map(Object key, Object value, OutputCollector output,
Reporter reporter) throws IOException {
  Text keyText = (Text)key;
  Text valueText = (Text)value;
  String valueString = new String(valueText.getBytes(), "UTF-8");
  String keyString = getXmlKey(valueString);
  keyText.set(keyString);
  valueText.set(valueString);
  output.collect(keyText, valueText);
 }
 
 
 private String getXmlKey(String value) {
        // Get the Key from the XML in the value.
 }
 
}
 
What was interesting about this is the fact that the XML was getting
munged within the Map Phase.  When I changed over to the code at the
top, the Map phase was fine.  However, the Reduce phase picks up the
munged XML.  Trying to debug the problem, I came across this method in
the Text Object:
 
public void set(byte[] utf8, int start, int len) {
    setCapacity(len, false);
    System.arraycopy(utf8, start, bytes, 0, len);
    this.length = len;
}
 
If the "bytes" array had a length of 1000 and the "utf8" array has a
length of 500; doing a System.arraycopy() would only copy the first 500
from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  Could
this be the cause of the XML munging?
 
All of this leads me to a few questions:
 
1) Has anyone successfully used XML snippets as the data format within a
MapReduce job; not just reading from the file but used during the
shuffle?
2) Is anyone seeing this problem with XML or any other format?
3) Does anyone know what is going on?
4) Is this a bug?
 

Thanks,
 
Peter 
 
 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message