hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Re: What if an XML file cross boundary of HDFS chunks?
Date Fri, 30 Oct 2009 00:42:37 GMT
Hi Steve,

When you want to read xml, you should provide your custom InputFormat which
extends FileInputFormat.

and override the method isSplitable to not split a file , that means one xml
file for one mapper.


  protected boolean isSplitable(FileSystem fs, Path filename) {
    return false;
  }



Best Regards,

Jeff zhang



On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <steve.gao@yahoo.com> wrote:

>
> Does anybody have the similar issue? If you store XML files in HDFS, how
> can you make sure a chunk reads by a mapper does not contain partial data of
> an XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message