hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Kreps" <jay.kr...@gmail.com>
Subject Re: Serialization with additional schema info
Date Thu, 04 Sep 2008 14:35:39 GMT
Yes, I mean this is just the trade-off between structured and
unstructured data.  In my case 99% of my data sources are structured.
So if I am expecting List<String> and get List<Integer> then something
is broken and I want to catch the bug before someone writes the bad
data. I agree that in principle a compression algorithm should be able
to give me comparable compactness with some CPU trade-off.


---------- Forwarded message ----------
From: "Ted Dunning" <ted.dunning@gmail.com>
To: core-dev@hadoop.apache.org
Date: Wed, 3 Sep 2008 21:24:00 -0700
Subject: Re: Serialization with additional schema info
I talked to the IBM guys about this problem with JSON-like formats.

Their answer was that if you care enough, then any compression algorithm
around will compress away the type information.

So if you have a splittable compressed format (bz2 works with hadoop), you
are set except for the compression cost.  Decompression cost is usually
compensated for by the I/O advantage.

On Wed, Sep 3, 2008 at 3:52 PM, Jay Kreps <jay.kreps@gmail.com> wrote:

> ...
> Thanks for the pointer to jaql, that seems very cool, but I believe
> jaql would have the same problem if they tried to implement any kind
> of compact structured storage.  Jaql would return a JArray or JRecord
> which might have a variety of fields and you would want to store the
> data about what kinds of fields separately.

View raw message