hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6708) New file format for very large records
Date Fri, 16 Apr 2010 00:10:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857609#action_12857609

Aaron Kimball commented on HADOOP-6708:


bq. * Length fields are encoded as integers, not longs. This does not support records >
2 GB.
bq. This is an intentional restriction. All integers are in VInt/VLong format which is fully
wire compatible. You can easily make a case to request such limit be lifted.

So does this mean that the API for TFile could be changed without complication to accept/return
{{long}} values? I read the TFile spec and it points out in several different locations the
2 GB value limit. By reading that, it sounds as though other aspects of TFile may break based
on the assumed integer size there.

bq. Even if you do not know the length of the record you write (namely specifying -1 during
writing), you can still efficiently skip a record (even after partially consuming some bytes
of the record). Isn't it sufficient for your case? Searching for a synchronization boundary
is very inefficient than length-prefixed encoding.

Data comes to me from JDBC through an InputStream or a Reader that I am not sure how long
it is. I read from that InputStream/Reader and write its contents into an OutputStream/Writer
that dumps into a file (LobFile). In the case where I have a character-based Reader, I know
how many characters I have, which is a lower bound on the number of bytes, but not exact.
 So my plan was to seek ahead by that much, then search for the boundary. Assuming most characters
are one byte, the search will be pretty quick.

How does TFile support length skipping if you don't pre-declare the lengths?

> New file format for very large records
> --------------------------------------
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message