Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 763 invoked from network); 26 Jan 2009 17:54:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Jan 2009 17:54:11 -0000 Received: (qmail 50468 invoked by uid 500); 26 Jan 2009 17:54:05 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 50423 invoked by uid 500); 26 Jan 2009 17:54:05 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 50412 invoked by uid 99); 26 Jan 2009 17:54:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2009 09:54:05 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jason.hadoop@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2009 17:53:55 +0000 Received: by wa-out-1112.google.com with SMTP id v27so657549wah.29 for ; Mon, 26 Jan 2009 09:53:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=whBV1EhPH2EiE7R1YGvwvZcYQ4659hNsz5KnjOIZFnY=; b=IF5x1a1/1/4oJGYnZUDvVtYoILCVPaAJ29fFQ6pb8kDbzQctTy0aWbYWbhbn/cxl95 kqhgS7zDU3fhs3qPdqQ0DOIGBuMLvEniwSnLi6y+FAmDP66eB/y9D3+Iihjzp7hFGn+3 tTKUMc5Gy9wrPpbtkLKpaHm/1dSpEnoLVqAM8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=FQQ7u2SlSf3obIL9oxqTSoD3VJARbzYyC6kup8S8ChXiqg0ZSjX5P14e6d0OAYdLhX 6HYoz9VjoNXvv8PhQGWk4i5Qx3+4dYJ40ZYUH5Y5gotQeCXi9V7xm6GhisMT1bN48qMc eXRUEqkhMd/jJ4BJGRU3v3LlGllVEdG5jEy+Q= MIME-Version: 1.0 Received: by 10.114.135.1 with SMTP id i1mr4659942wad.193.1232992415499; Mon, 26 Jan 2009 09:53:35 -0800 (PST) In-Reply-To: References: <497A4E00.3000102@yahoo-inc.com> <497A579D.8070707@yahoo-inc.com> <771c51630901232101q35e51276rfa666a11fe814179@mail.gmail.com> <771c51630901241859h24fbfbdbwa147284746657f01@mail.gmail.com> <497DF33E.1060603@apache.org> Date: Mon, 26 Jan 2009 09:53:35 -0800 Message-ID: <314098690901260953l6fa4c4b8tedf39573acbff6a@mail.gmail.com> Subject: Re: HDFS - millions of files in one directory? From: jason hadoop To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636417a85efc52b0461666ae7 X-Virus-Checked: Checked by ClamAV on apache.org --001636417a85efc52b0461666ae7 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit We like compression if the data is readily compressible and large as it saves on IO time. On Mon, Jan 26, 2009 at 9:35 AM, Mark Kerzner wrote: > Doug, > SequenceFile looks like a perfect candidate to use in my project, but are > you saying that I better use uncompressed data if I am not interested in > saving disk space? > > Thank you, > Mark > > On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting wrote: > > > Philip (flip) Kromer wrote: > > > >> Heretrix , > >> Nutch, > >> others use the ARC file format > >> http://www.archive.org/web/researcher/ArcFileFormat.php > >> http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml > >> > > > > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to > > store crawled pages. The keys of crawl content files are URLs and the > > values are: > > > > > > > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html > > > > I believe that the implementation of this class pre-dates SequenceFile's > > support for compressed values, so the values are decompressed on demand, > > which needlessly complicates its implementation and API. It's basically > a > > Writable that stores binary content plus headers, typically an HTTP > > response. > > > > Doug > > > --001636417a85efc52b0461666ae7--