hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-3315) New binary file format
Date Thu, 29 Jan 2009 02:17:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668286#action_12668286
] 

hong.tang edited comment on HADOOP-3315 at 1/28/09 6:15 PM:
------------------------------------------------------------

bq.  Should probably be explicit about encoding in the below:
 public ByteArray(String str) {
 this(str.getBytes());

Good catch. This seems to be some code to facilitate testing, but was not properly cleaned
up. I should remove that constructor.

bq. Would be nice if we could easily pass a alternate implementation of BCFile, say one that
cached blocks.

Possible, but I think it is too early to make BCFile API public. Can you also elaborate on
the example you mentioned? Why do you need block caching instead of key-value caching (given
that TFile is based on <key, value> pairs)?

bq. Do you want to fix the below:

  // TODO: remember the longest key in a TFile, and use it to replace
  // MAX_KEY_SIZE.
  keyBuffer = new byte[MAX_KEY_SIZE];

bq. Default buffers of 64k for keys is a bit on the extravagant side.

Yes, it is an easy fix. I intend to do that at a later time when we gather more information
about what statistics we should collect during file creation time and put them in one meta
block. I don't imagine it being an urgent issue though except for applications that open up
hundreds of files (or scanners) simultaneously.

bq. Below should be public so users don't have to define their own: 
  
 protected final static String JCLASS = "jclass:";

Certainly. Possibly also true for symbolic names for various compression algorithms.

bq. API seems to have changed since last patch. There nolonger a #find method. Whats the suggested
way of accessing a random single key/value? (Open scanner using what would you suggest for
start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte
array. Should there be a seekTo that takes a RawComparable that is public?).

Yes, the API is changed so that we do not have to scan through a compressed block twice (first
get a location object, then use it with the scanner). I'd suggest to do random access as follows:

    Scanner scanner = reader.createScanner();
    ...
    if (scanner.seekTo(bytes, offset, length)) {
        Entry entry = scanner.entry();
        // access value through either entry.getValue or entry.writeValue 
    }


      was (Author: hong.tang):
    bq Should probably be explicit about encoding in the below:
public ByteArray(String str) {
this(str.getBytes());

Good catch. This seems to be some code to facilitate testing, but was not properly cleaned
up. I should remove that constructor.

bq. Would be nice if we could easily pass a alternate implementation of BCFile, say one that
cached blocks.

Possible, but I think it is too early to make BCFile API public. Can you also elaborate on
the example you mentioned? Why do you need block caching instead of key-value caching (given
that TFile is based on <key, value> pairs)?

bq Do you want to fix the below:
// TODO: remember the longest key in a TFile, and use it to replace
        // MAX_KEY_SIZE.
        keyBuffer = new byte[MAX_KEY_SIZE];
Default buffers of 64k for keys is a bit on the extravagant side.

Yes, it is an easy fix. I intend to do that at a later time when we gather more information
about what statistics we should collect during file creation time and put them in one meta
block. I don't imagine it being an urgent issue though except for applications that open up
hundreds of files (or scanners) simultaneously.

bq. Below should be public so users don't have to define their own: 
   protected final static String JCLASS = "jclass:";

Certainly. Possibly also true for symbolic names for various compression algorithms.

bq. API seems to have changed since last patch. There nolonger a #find method. Whats the suggested
way of accessing a random single key/value? (Open scanner using what would you suggest for
start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte
array. Should there be a seekTo that takes a RawComparable that is public?).

Yes, the API is changed so that we do not have to scan through a compressed block twice (first
get a location object, then use it with the scanner). I'd suggest to do random access as follows:
    Scanner scanner = reader.createScanner();
    ...
    if (scanner.seekTo(bytes, offset, length)) {
        Entry entry = scanner.entry();
        // access value through either entry.getValue or entry.writeValue 
    }

  
> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch, HADOOP-3315_20080915_TFILE.patch,
hadoop-trunk-tfile.patch, hadoop-trunk-tfile.patch, TFile Specification 20081217.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message