Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 89348 invoked from network); 2 Jul 2010 22:15:34 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Jul 2010 22:15:34 -0000 Received: (qmail 26811 invoked by uid 500); 2 Jul 2010 22:15:32 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 26778 invoked by uid 500); 2 Jul 2010 22:15:31 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 26770 invoked by uid 99); 2 Jul 2010 22:15:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jul 2010 22:15:31 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.214.176] (HELO mail-iw0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jul 2010 22:15:24 +0000 Received: by iwn37 with SMTP id 37so1145575iwn.35 for ; Fri, 02 Jul 2010 15:14:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.118.28 with SMTP id t28mr1251667ibq.131.1278108843029; Fri, 02 Jul 2010 15:14:03 -0700 (PDT) Received: by 10.231.39.198 with HTTP; Fri, 2 Jul 2010 15:14:02 -0700 (PDT) In-Reply-To: <4C2E5FFB.4000501@darose.net> References: <4C2E5FFB.4000501@darose.net> Date: Fri, 2 Jul 2010 15:14:02 -0700 Message-ID: Subject: Re: Text files vs. SequenceFiles From: Alex Loddengaard To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016369204e792b492048a6ee74f X-Virus-Checked: Checked by ClamAV on apache.org --0016369204e792b492048a6ee74f Content-Type: text/plain; charset=ISO-8859-1 Hi David, On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch wrote: > > * We should use a SequenceFile (binary) format as it's faster for the > machine to read than parsing text, and the files are smaller. > > * We should use a text file format as it's easier for humans to read, > easier to change, text files can be compressed quite small, and a) if the > text format is designed well and b) given the context of a distributed > system like Hadoop where you can throw more nodes at a problem, the text > parsing time will wind up being negligible/irrelevant in the overall > processing time. > SequenceFiles can also be compressed, either per record or per block. This is advantageous if you want to use gzip, because gzip isn't splittable. A SF compressed by blocks is therefor splittable, because each block is gzipped vs. the entire file being gzipped. As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for SequenceFiles. Lastly, I promise that eventually you'll run out of space in your cluster and wish you did better compression. Plus compression makes jobs faster. The general recommendation is to use SequenceFiles as early in your ETL as possible. Usually people get their data in as text, and after the first MR pass they work with SequenceFiles from there on out. Alex --0016369204e792b492048a6ee74f--