Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 41149 invoked from network); 4 Sep 2008 14:36:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Sep 2008 14:36:29 -0000 Received: (qmail 30680 invoked by uid 500); 4 Sep 2008 14:36:24 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 30653 invoked by uid 500); 4 Sep 2008 14:36:24 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 30642 invoked by uid 99); 4 Sep 2008 14:36:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 07:36:24 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jay.kreps@gmail.com designates 72.14.220.156 as permitted sender) Received: from [72.14.220.156] (HELO fg-out-1718.google.com) (72.14.220.156) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 14:35:26 +0000 Received: by fg-out-1718.google.com with SMTP id l26so440009fgb.35 for ; Thu, 04 Sep 2008 07:35:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type:content-transfer-encoding :content-disposition; bh=PZIR4wHKs63EuT0USCWjYk/+zsJzGt7OB874IO36/9E=; b=IwYLWGJ09lFvFtwKENbcbEywlZc7aMyYxu6DdaBjni7not5Vk+ALCsSgZOqJXxdx6r XMmsCqE3e3ueWEI2us2au2LWlwr766HSCrTNVi2JioA8ejJ6VzVrFQIocSXgyTQ3moeN jrvaO4FXgFEfy37krr6oqiWyr3iKKXa32ZiKA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type :content-transfer-encoding:content-disposition; b=UA5oC3U18ajg/JF2TGF3VcVCFM7gXb4a19hsVh9LJBK2eJ49p1nWIbKitKVdmoqWRU esUBuVi/Z7sIaMaNxuqOcBc2h7KL8y79Lrn61XusgYuDmR33Lc4LZMAcagCiBpdhIPFz 1np9ThNctdXIhkv1bWJWlesY6xdXOXS9eJeus= Received: by 10.86.89.1 with SMTP id m1mr7749145fgb.68.1220538939552; Thu, 04 Sep 2008 07:35:39 -0700 (PDT) Received: by 10.86.57.5 with HTTP; Thu, 4 Sep 2008 07:35:39 -0700 (PDT) Message-ID: <83290b460809040735m1bb1f315l5b3ae61551382406@mail.gmail.com> Date: Thu, 4 Sep 2008 07:35:39 -0700 From: "Jay Kreps" To: ted.dunning@gmail.com, core-dev@hadoop.apache.org Subject: Re: Serialization with additional schema info MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org Yes, I mean this is just the trade-off between structured and unstructured data. In my case 99% of my data sources are structured. So if I am expecting List and get List then something is broken and I want to catch the bug before someone writes the bad data. I agree that in principle a compression algorithm should be able to give me comparable compactness with some CPU trade-off. -Jay ---------- Forwarded message ---------- From: "Ted Dunning" To: core-dev@hadoop.apache.org Date: Wed, 3 Sep 2008 21:24:00 -0700 Subject: Re: Serialization with additional schema info I talked to the IBM guys about this problem with JSON-like formats. Their answer was that if you care enough, then any compression algorithm around will compress away the type information. So if you have a splittable compressed format (bz2 works with hadoop), you are set except for the compression cost. Decompression cost is usually compensated for by the I/O advantage. On Wed, Sep 3, 2008 at 3:52 PM, Jay Kreps wrote: > ... > > Thanks for the pointer to jaql, that seems very cool, but I believe > jaql would have the same problem if they tried to implement any kind > of compact structured storage. Jaql would return a JArray or JRecord > which might have a variety of fields and you would want to store the > data about what kinds of fields separately. >