hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7404) Data Blocks Spliting should be record oriented or provided option for give the spliting locations (offsets) as input file
Date Sun, 19 Jun 2011 19:51:47 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051727#comment-13051727
] 

Harsh J commented on HADOOP-7404:
---------------------------------

Even if you go the padding way, you can add operations when you export processed files outside
of Hadoop to get rid of things that would make the file inconsistent. Sure, it would be quite
some IO wastage but until you have a new feature that can perhaps ease this, it would work
well.

Also, have you considered using, or evaluated alternative FSes for the same purpose? Hadoop
is designed to work across different FS as well!

> Data Blocks Spliting should be record oriented or provided option for give the spliting
locations (offsets) as input file
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-7404
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7404
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Sunil Goyal
>
> Old Bug :  https://issues.apache.org/jira/browse/HADOOP-106
> It is difficult to do the padding in the existing records. Due to the following reason:
> 1. Records are having the different Size (some may be bytes, some may be GB) but in same
file.
> 2. It is having the compatibility issues with the other standard tools.
> 3. It will increases the file size without any need of other tools (not working on hadoop).
> I think there should be option to this splitting process like this:-   
> 1. File contains information of offsets where should be splitting done. (like 10,100,120,
offset it).
> 2. Hadoop should do the splitting according to it ( 10-0 = 10, 100-10 =90 , etc).
> 3. This file can be generated easily from the other tools.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message