hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Best symbol for spliting line
Date Wed, 13 Aug 2008 20:56:41 GMT
Character separated values (CSV in general) work fairly well, but you
instantly get into questions of quoting without a lot of really good
support.  Whether the separator is comma, tab, pipe or space doesn't much
matter, you still have issues.  With comma separation, double quotes are
often used with either doubled double quotes or backslash as the quoting
mechanism.  With tab delimiters, \t is commonly used.  With space delimiting
as in many log files formats, URL encoding is commonly used.

I personally think that strings are frail vessels for data and recommend
that you consider Thrift or JSON.

On Wed, Aug 13, 2008 at 7:26 AM, Kunsheng Chen <keyek@yahoo.com> wrote:

> I am modifying WordCount  and replace Key and Value with URLs. Hopefully I
> add an attribute for each pair of them.  So generally I have something like
> this in a file:
> src      dest     attribute
> Could anyone give me some idea that what is the best symbol to split the
> whole line ? I am not going to use space since the attribute may need it.
> Also I am thinking of using '|' but it might confuse with URls.
> Also is combination symbols working?
> Any idea is appreciated.
> Thanks!
> Kun


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message