hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Srikanth Kakani (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-3315) New binary file format
Date Mon, 12 May 2008 20:29:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596184#action_12596184
] 

srikantk edited comment on HADOOP-3315 at 5/12/08 1:28 PM:
------------------------------------------------------------------

> Does RO stand for something, or is it short for "row"?
RO = Rowid Offset

>The RO entry values can be more compactly represented as differences from the prior entry.
Is this intended? If so, we should state this.
I did not intend it that way, but we should do it, since the RO index is always memory loaded
we can compute the entries while loading the index.

> In data blocks, we might use something like <entryLength><keyLength><key><value>.
This would permit one to skip entire entries more quickly. The valueLength can be computed
as entryLength-keyLength. Do folks think this is worthwhile?

I think it should not matter much, but since I heard this request third time, it seems like
a good thing to do.

> Owen O'Malley - 12/May/08 11:44 AM

> getClosest should specify that it means closest after the given key
Will update the document

> Doug Cutting - 12/May/08 10:53 AM
> row ids should be longs
There is an inconsistency in the Read API. Will fix that.

> does the compression of the key index follow tfile.compressionCodec?
I believe so, I will explicitly state that in the document.

> should the ioffset and moffset (or my key, row, and meta offsets) be vints?
We will need to store the offsets for these somewhere, probably not.

> I think the append method that takes an input stream should be:
{code}
void appendRaw(int keyLength, InputStream key, int valueLength, InputStream value) throws
IOException;
{code}
Keys are supposed to be memory loadable, maximum length of 64K.  I am not sure if this interface
will be / should be used.  We may want to keep that for later.

> Most of the methods should have "throws IOException"
Will update the spec with this.

> It is useful to be able to get the key/value class names without the class. I'd replace
the getKeyClass and getValueClass with string equivalents:
I think the getMeta should take care of it.

> stack - 12/May/08 12:15 PM
> In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats
this mean? The key that is at the start of each block will be in the block index? And only
this? Or will index have entries keys from the middle of blocks in it?
It should have been __sparse__ index. It currently means. the key that is at the start of
each block will be in the block index? The minBlock size should be adjusted to handle the
"sparseness".

> In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also
says 'support seek to key and seek to row'. What is the difference between a key and a row?
Kinds: Keyless columns, keyed columns, valueless columns, fixed size, and variable size. Maybe
I should put it up there.

> Its not plain that user can add their own metadata to imeta. You might explicitly state
this.
Will do.

> In the Writer API, you state that a null key class is for a keyless column. Whats a null
value class imply?
Null value implies a value less column, a keyless column has some implication on the keyidx
viz. it compresses excellently. I thought it was worth mentioning a keyless column.

> Doug Cutting - 12/May/08 12:37 PM
> I'd vote for supporting both String and byte[] as metadata values, perhaps with methods
like:
{code}
Writer#setMeta(String key, String value);
Writer#setMetaBytes(String key, byte[] value);

String Reader#getMeta(String key);
byte[] Reader#getMetaBytes(String key);
{code}

I thinks it makes sense to add these. (As opposed to having them in the constructor).  The
getMeta functions are a nobrainer.
The only problem I see is when do we allow the setMeta functions, such as setting things like
compression and KeyClass names may be problematic.
So maybe we allow the TFile.* keys to be set only using the constructor and the rest of user
variables to be set via the setMeta(??)





      was (Author: srikantk):
    
> Does RO stand for something, or is it short for "row"?
RO = Rowid Offset

>The RO entry values can be more compactly represented as differences from the prior entry.
Is this intended? If so, we should state this.
I did not intend it that way, but we should do it, since the RO index is always memory loaded
we can compute the entries while loading the index.

> In data blocks, we might use something like <entryLength><keyLength><key><value>.
This would permit one to skip entire entries more quickly. The valueLength can be computed
as entryLength-keyLength. Do folks think this is worthwhile?

I think it should not matter much, but since I heard this request third time, it seems like
a good thing to do.

> Owen O'Malley - 12/May/08 11:44 AM

> getClosest should specify that it means closest after the given key
Will update the document

> Doug Cutting - 12/May/08 10:53 AM
> row ids should be longs
There is an inconsistency in the Read API. Will fix that.

> does the compression of the key index follow tfile.compressionCodec?
I believe so, I will explicitly state that in the document.

> should the ioffset and moffset (or my key, row, and meta offsets) be vints?
We will need to store the offsets for these somewhere, probably not.

> I think the append method that takes an input stream should be:
{code}
void appendRaw(int keyLength, InputStream key, int valueLength, InputStream value) throws
IOException;
{/code}
Keys are supposed to be memory loadable, maximum length of 64K.  I am not sure if this interface
will be / should be used.  We may want to keep that for later.

> Most of the methods should have "throws IOException"
Will update the spec with this.

> It is useful to be able to get the key/value class names without the class. I'd replace
the getKeyClass and getValueClass with string equivalents:
I think the getMeta should take care of it.

> stack - 12/May/08 12:15 PM
> In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats
this mean? The key that is at the start of each block will be in the block index? And only
this? Or will index have entries keys from the middle of blocks in it?
It should have been __sparse__ index. It currently means. the key that is at the start of
each block will be in the block index? The minBlock size should be adjusted to handle the
"sparseness".

> In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also
says 'support seek to key and seek to row'. What is the difference between a key and a row?
Kinds: Keyless columns, keyed columns, valueless columns, fixed size, and variable size. Maybe
I should put it up there.

> Its not plain that user can add their own metadata to imeta. You might explicitly state
this.
Will do.

> In the Writer API, you state that a null key class is for a keyless column. Whats a null
value class imply?
Null value implies a value less column, a keyless column has some implication on the keyidx
viz. it compresses excellently. I thought it was worth mentioning a keyless column.

> Doug Cutting - 12/May/08 12:37 PM
> I'd vote for supporting both String and byte[] as metadata values, perhaps with methods
like:
{code}
Writer#setMeta(String key, String value);
Writer#setMetaBytes(String key, byte[] value);

String Reader#getMeta(String key);
byte[] Reader#getMetaBytes(String key);
{/code}

I thinks it makes sense to add these. (As opposed to having them in the constructor).  The
getMeta functions are a nobrainer.
The only problem I see is when do we allow the setMeta functions, such as setting things like
compression and KeyClass names may be problematic.
So maybe we allow the TFile.* keys to be set only using the constructor and the rest of user
variables to be set via the setMeta(??)




  
> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Srikanth Kakani
>         Attachments: Tfile-1.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message