Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5EA9A803B for ; Fri, 12 Aug 2011 10:29:00 +0000 (UTC) Received: (qmail 30577 invoked by uid 500); 12 Aug 2011 10:28:53 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 30359 invoked by uid 500); 12 Aug 2011 10:28:36 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 30337 invoked by uid 99); 12 Aug 2011 10:28:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Aug 2011 10:28:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of joey@cloudera.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Aug 2011 10:28:21 +0000 Received: by fxg7 with SMTP id 7so3574936fxg.35 for ; Fri, 12 Aug 2011 03:28:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.44.198 with SMTP id b6mr112895faf.141.1313144880890; Fri, 12 Aug 2011 03:28:00 -0700 (PDT) Received: by 10.223.87.71 with HTTP; Fri, 12 Aug 2011 03:28:00 -0700 (PDT) In-Reply-To: <1313139409.51608.YahooMailNeo@web36102.mail.mud.yahoo.com> References: <1313135635.27469.YahooMailNeo@web36108.mail.mud.yahoo.com> <1313139409.51608.YahooMailNeo@web36102.mail.mud.yahoo.com> Date: Fri, 12 Aug 2011 06:28:00 -0400 Message-ID: Subject: Re: Hadoop--store a sequence file in distributed cache? From: Joey Echeverria To: common-user@hadoop.apache.org, Sofia Georgiakaki Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org You can use any kind of format for files in the distributed cache, so yes you can use sequence files. They should be faster to parse than most text formats. -Joey On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki wrote: > Thank you for the reply! > In each map(), I need to open-read-close these files (more than 2 in the = general case, and maybe up to 20 or more), in order to make some checks. Co= nsidering the huge amount of data in the input, making all these file opera= tions on HDFS will kill the performance!!! So I think it would be better to= store these files in distributed Cache, so that the whole process would be= more efficient -I guess this is the point of using Distributed Cache in th= e first place! > > My question is, if I can store sequence files in distributed Cache and ha= ndle them using e.g. the SequenceFile.Reader class, or if I should only kee= p regular text files in distributed Cache and handle them using the usual j= ava API. > > Thank you very much > Sofia > > PS: The files have small size, a few KB to few MB maximum. > > > > ________________________________ > From: Dino Ke=C4=8Do > To: common-user@hadoop.apache.org; Sofia Georgiakaki > Sent: Friday, August 12, 2011 11:30 AM > Subject: Re: Hadoop--store a sequence file in distributed cache? > > Hi Sofia, > > I assume that output of first job is stored on HDFS. In that case I would > directly read file from Mappers without using distributed cache. If you p= ut > file into distributed cache that would add one more copy operation into y= our > process. > > Thanks, > dino > > > On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki > wrote: > >> Good morning, >> >> I would like to store some files in the distributed cache, in order to b= e >> opened and read from the mappers. >> The files are produced by an other Job and are sequence files. >> I am not sure if that format is proper for the distributed cache, as the >> files in distr.cache are stored and read locally. Should I change the fo= rmat >> of the files in the previous Job and make them Text Files maybe and read >> them from the Distr.Cache using tha simple Java API? >> Or can I still handle them with the usual way we use sequence files, eve= n >> if they reside in the local directory? Performance is extremely importan= t >> for my project, so I don't know what the best solution would be. >> >> Thank you in advance, >> Sofia Georgiakaki --=20 Joseph Echeverria Cloudera, Inc. 443.305.9434