hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. P. Pendleton" ...@geekdom.net>
Subject Re: SequenceFile (Text,Text) becomes plain text
Date Sat, 03 Feb 2007 02:08:06 GMT
Yes, it would be nice to fix that at some point.

Possibly a shadow file that keeps track of the offset of each key/value in
the file (probably using Vint-encoded difference-from-last-value). The
existing output would be preserved, but someone reading the file could use
such a "cheat sheet" to reconstitute the proper key/value sets, without
having to do any unescaping of tabs of newlines. And normal text tools could
still do something, albeit possibly led astray by extra tabs or newlines in
the data.

On 2/2/07, Owen O'Malley <owen@yahoo-inc.com> wrote:
>
>
> On Feb 2, 2007, at 2:46 PM, Bryan A. P. Pendleton wrote:
>
> > Note that, unless there are no tab characters in the keys of the
> > output from
> > the first job, there's no way to read the existing output
> > accurately back
> > in.
>
> *Sigh* That asymmetry in Text{In,Out}putFormat has bothered me for a
> while now. I think at some point, we should do a TabText{In,Out}
> putFormat that looks like:
>
> <key>\t<value>\n with tabs and newlines escaped in the keys and values.
>
> That will give us a symmetric set of text formats. Furthermore, I'd
> say that if value == NULL, the tab should be left off.
>



-- 
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message