Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A88AD807 for ; Mon, 17 Sep 2012 13:47:32 +0000 (UTC) Received: (qmail 59403 invoked by uid 500); 17 Sep 2012 13:47:27 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 59328 invoked by uid 500); 17 Sep 2012 13:47:27 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 59319 invoked by uid 99); 17 Sep 2012 13:47:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 13:47:26 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 13:47:22 +0000 Received: by obbtb18 with SMTP id tb18so11754092obb.35 for ; Mon, 17 Sep 2012 06:47:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=f9BDG7IVADqBC1fVF+lszzcD35uehJ6IOsD0bFrc0tQ=; b=elNvcbXOokDGmqBEnkYBJAeG0+DvUxXmSOAcJSpSu9OaKBo1ai77Zjee5iypOcEfQa 4Kmfy9WSPTHsMIGs4yZ5iqBZqGSUQokJuhBEkUec3ahKoM1kCL6bUQXK0447o2CfVsCw iC7WJ5HJg6IS8qYPPSgqTFFw6bxL3ZWItiRPWa6AbP6vy3lAz5scFvwKZfBsYZKYj/8q rONJqUOAYUBDcnw84NKJJ6Kv5gGs+GkGVzURX70vrhKG7RUV42y6ChnXnfDeGm5lOvFY 5vp0Frq1vYYqyoIksul2zYK7BEJWUGCXEKOUgkzWBUbU9hkeWcER0nUSW/nCfZwcIdU2 8eGA== Received: by 10.182.64.47 with SMTP id l15mr11828338obs.4.1347889622089; Mon, 17 Sep 2012 06:47:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.11.168 with HTTP; Mon, 17 Sep 2012 06:46:41 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Mon, 17 Sep 2012 19:16:41 +0530 Message-ID: Subject: Re: Reading from HDFS from inside the mapper To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQkO3R/lBdzVqt6kdZJ9Aav2R/xw1svdytJGhQHEZl3lJMOp97hlA+VXcVoWamOxRihwtu04 X-Virus-Checked: Checked by ClamAV on apache.org Sigurd, The implementation of fs -ls in the LocalFileSystem relies on Java's File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list() which states "There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order.". That may just be what is biting you, since standalone mode uses LFS. On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann wrote: > I've tracked down the problem to only occur in standalone mode. In > pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu > 12.04 64bit. When I access the directory in linux directly, everything looks > normal. It's just when I access it through hadoop. Has anyone seen this > problem before and knows a solution? > > Thanks, > Sigurd > > > 2012/9/17 Sigurd Spieckermann >> >> I'm experiencing a strange problem right now. I'm writing part-files to >> the HDFS providing initial data and (which should actually not make a >> difference anyway) write them in ascending order, i.e. part-00000, >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they >> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is >> that possible? Why aren't they shown in natural order? Also the map-side >> join package considers them in this order which causes problems. >> >> >> 2012/9/10 Sigurd Spieckermann >>> >>> OK, interesting. Just to confirm: is it okay to distribute quite large >>> files through the DistributedCache? Dataset B could be on the order of >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then >>> the probability that every node will have to read (almost) every block of B >>> is quite high so given DC is okay here in general, it would be more >>> efficient to use DC over HDFS reading. How about the case though that I have >>> m*n nodes, then every node would receive all of B while only needing a small >>> fraction, right? Could you maybe elaborate on this in a few sentence just to >>> be sure I understand Hadoop correctly? >>> >>> Thanks, >>> Sigurd >>> >>> 2012/9/10 Harsh J >>>> >>>> Sigurd, >>>> >>>> Hemanth's recommendation of DistributedCache does fit your requirement >>>> - it is a generic way of distributing files and archives to tasks of a >>>> job. It is not something that pushes things automatically in memory, >>>> but on the local disk of the TaskTracker your task runs on. You can >>>> choose to then use a LocalFileSystem impl. to read it out from there, >>>> which would end up being (slightly) faster than your same approach >>>> applied to MapFiles on HDFS. >>>> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann >>>> >>>> wrote: >>>> > I checked DistributedCache, but in general I have to assume that none >>>> > of the >>>> > datasets fits in memory... That's why I was considering map-side join, >>>> > but >>>> > by default it doesn't fit to my problem. I could probably get it to >>>> > work >>>> > though, but I would have to enforce the requirements of the map-side >>>> > join. >>>> > >>>> > >>>> > 2012/9/10 Hemanth Yamijala >>>> >> >>>> >> Hi, >>>> >> >>>> >> You could check DistributedCache >>>> >> >>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). >>>> >> It would allow you to distribute data to the nodes where your tasks >>>> >> are run. >>>> >> >>>> >> Thanks >>>> >> Hemanth >>>> >> >>>> >> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann >>>> >> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> I would like to perform a map-side join of two large datasets where >>>> >>> dataset A consists of m*n elements and dataset B consists of n >>>> >>> elements. For >>>> >>> the join, every element in dataset B needs to be accessed m times. >>>> >>> Each >>>> >>> mapper would join one element from A with the corresponding element >>>> >>> from B. >>>> >>> Elements here are actually data blocks. Is there a performance >>>> >>> problem (and >>>> >>> difference compared to a slightly modified map-side join using the >>>> >>> join-package) if I set dataset A as the map-reduce input and load >>>> >>> the >>>> >>> relevant element from dataset B directly from the HDFS inside the >>>> >>> mapper? I >>>> >>> could store the elements of B in a MapFile for faster random access. >>>> >>> In the >>>> >>> second case without the join-package I would not have to partition >>>> >>> the >>>> >>> datasets manually which would allow a bit more flexibility, but I'm >>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also, >>>> >>> does >>>> >>> Hadoop have a cache for such situations by any chance? >>>> >>> >>>> >>> I appreciate any comments! >>>> >>> >>>> >>> Sigurd >>>> >> >>>> >> >>>> > >>>> >>>> >>>> >>>> -- >>>> Harsh J >>> >>> >> > -- Harsh J