hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6708) New file format for very large records
Date Thu, 22 Apr 2010 21:32:53 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860026#action_12860026

Aaron Kimball commented on HADOOP-6708:

| But that schema can be just "bytes".

Of course. And Sqoop would use such a file in this manner. But in building a feature into
Avro's file format, would it be possible to include a {{getRecordAsByteStream()}} / {{getRecordAsCharStream()}}
that makes sense in the context of a file where many underlying schemata don't necessarily
make sense in a byte-wise form?

| True. Would this be hard to add?

:smile: You'd be in a better position than I to comment on that.

As for the common/sqoop question: I have written prototype code that provides this file format
in Sqoop itself, but I haven't pushed it out yet. If it's infeasible to add this to Hadoop
common, then I'll continue to polish that prototype code and just include it directly in Sqoop.
However in discussion with other engineers, it's come up that such a very-large-record format
may have broader applications than just Sqoop. Furthermore, people will want to use inputformats,
etc., that operate over these records. Folks could link against Sqoop's jar to get to these
file formats and InputFormat classes, but that's One More Dependency that they may not want
to manage. Given that these records are just byte or character streams, it doesn't seem necessary
to restrict it to just Sqoop. Also the ability to expand the scope of an existing format to
encapsulate these records could lower maintenance costs over time for clients who are storing
data in this format.

> New file format for very large records
> --------------------------------------
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message