hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Campbell (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-3465) org.apache.hadoop.streaming.StreamXmlRecordReader
Date Thu, 29 May 2008 18:47:45 GMT

                 Key: HADOOP-3465
                 URL: https://issues.apache.org/jira/browse/HADOOP-3465
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.17.0
         Environment: java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Client VM (build 10.0-b22, mixed mode, sharing)

Linux hadoop-master #1 SMP Wed May 7 16:50:09 EDT 2008 i686 i686 i386 GNU/Linux

            Reporter: David Campbell
             Fix For: 0.17.0

I downloaded and installed the 0.17.0 version this morning.

I'm trying to use the StreamXmlRecordReader to parse a file that is formatted like this:

.....  many fields.

Each logical row has about 1,371 characters in it.

I have the following settings in my job.

 conf.set("stream.recordreader.begin", "<row>");
        conf.set("stream.recordreader.end", "</row>");
        conf.set("stream.recordreader.maxrec", "500000");

When I run my tests, the TaskTracker shows me a severely truncated row like this:

Processing record=<row>

I've tried setting the maxrec limits but even the default should be (as I read the code) more
than big enough to handle ~1,371 characters from <row> to </row>.

And as you might expect, the XML parser in my Mapper task blows up because most of the <row>
</row> is missing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message