hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil Goyal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7404) Data Blocks Spliting should be record oriented or provided option for give the spliting locations (offsets) as input file
Date Sun, 19 Jun 2011 16:49:47 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051706#comment-13051706
] 

Sunil Goyal commented on HADOOP-7404:
-------------------------------------

Hello Harsh
       I am from EDA (Electronics Design Automation) Domain. There are very large text file
of GDS (http://en.wikipedia.org/wiki/GDSII) of 20GB-100GB is common. It is the standard format
used by all the tools for exchange of data. All the standard tools (like Cadence, Synopsys,
Magma, other free tools) use this format. 
      This file contains the various records of different size. It is difficult to insert
the padding in this format. Standard tools will not be able to understand it. It is open standard
format. It is difficult to change this format. It is widely accepted by the industry. 
       It is easy to dump out the separate file giving the location of these offsets. It will
help to use the Hadoop for these types of format.
        There are also other format like this in EDA (LEF, DEF , Power Optimization format,
Simulation Result Format).

I think giving the option as offset file splitting will be helpful in general. Different type
of application user knows where they can get the maximum advantage parallelism by spliting
of it its own location.


> Data Blocks Spliting should be record oriented or provided option for give the spliting
locations (offsets) as input file
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-7404
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7404
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Sunil Goyal
>
> Old Bug :  https://issues.apache.org/jira/browse/HADOOP-106
> It is difficult to do the padding in the existing records. Due to the following reason:
> 1. Records are having the different Size (some may be bytes, some may be GB) but in same
file.
> 2. It is having the compatibility issues with the other standard tools.
> 3. It will increases the file size without any need of other tools (not working on hadoop).
> I think there should be option to this splitting process like this:-   
> 1. File contains information of offsets where should be splitting done. (like 10,100,120,
offset it).
> 2. Hadoop should do the splitting according to it ( 10-0 = 10, 100-10 =90 , etc).
> 3. This file can be generated easily from the other tools.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message