hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7404) Data Blocks Spliting should be record oriented or provided option for give the spliting locations (offsets) as input file
Date Sun, 19 Jun 2011 19:33:47 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051722#comment-13051722

Harsh J commented on HADOOP-7404:


(I've not worked with such files, so my analysis _may_ be wrong here -- do let me know if
it is)

I've found by experience that not all _actual_ file formats are directly suited to HDFS+MapReduce.
For your case, how viable do you think it would be to add in transformation phases for pushing
such files into HDFS and then pulling them out back in similar formats?

For example, lets say you could transform such large files into chunks of relatively-less-large
files (of say 2 GB) with as many records as it may accomodate and then load those into HDFS
(or do it as you stream, somehow -- you'll know your record end markers and sizes read thus-far).

These files can be created with large enough block sizes (2-4 GB is good enough I think, perhaps
you could go beyond as well but am not aware of many trying to go beyond the 8 GB block size
mark, and we haven't tested that much). This way you achieve the locality you're looking for.

Would splitting out your records from such large files into smaller chunks of records per
file not be a viable option here?

The issue with providing mechanisms to specify offsets is that you have to maintain all the
offsets in the NameNode memory instead of having to maintain just a tuple of the number of
blocks, length of file and the multiplier used. Not to mention, it could (if not pluggable
as a design/impl.) also make things more complex for people who do not really mind using non-record-based
splitting. But there's lot of merits here as well, as the other ticket mentions it.

> Data Blocks Spliting should be record oriented or provided option for give the spliting
locations (offsets) as input file
> -------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-7404
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7404
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Sunil Goyal
> Old Bug :  https://issues.apache.org/jira/browse/HADOOP-106
> It is difficult to do the padding in the existing records. Due to the following reason:
> 1. Records are having the different Size (some may be bytes, some may be GB) but in same
> 2. It is having the compatibility issues with the other standard tools.
> 3. It will increases the file size without any need of other tools (not working on hadoop).
> I think there should be option to this splitting process like this:-   
> 1. File contains information of offsets where should be splitting done. (like 10,100,120,
offset it).
> 2. Hadoop should do the splitting according to it ( 10-0 = 10, 100-10 =90 , etc).
> 3. This file can be generated easily from the other tools.  

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message