Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5CD6182A4 for ; Tue, 18 Aug 2015 20:25:51 +0000 (UTC) Received: (qmail 55605 invoked by uid 500); 18 Aug 2015 20:25:46 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 55514 invoked by uid 500); 18 Aug 2015 20:25:46 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 55501 invoked by uid 99); 18 Aug 2015 20:25:46 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2015 20:25:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C9BABDEE37 for ; Tue, 18 Aug 2015 20:25:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.15 X-Spam-Level: *** X-Spam-Status: No, score=3.15 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ePKpG-adktWP for ; Tue, 18 Aug 2015 20:25:36 +0000 (UTC) Received: from mail-io0-f180.google.com (mail-io0-f180.google.com [209.85.223.180]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 7AFA52055B for ; Tue, 18 Aug 2015 20:25:36 +0000 (UTC) Received: by iodv127 with SMTP id v127so187736797iod.3 for ; Tue, 18 Aug 2015 13:25:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=2xUEs5dzgwxYVUd8WNTRURXQWsGRA7yPArDwqS/neUU=; b=QzYHhTpdgAylUUEG76DAYrNBsWj1EoqfTsDFYvogvAAlOfLmFy7tzmG6817P8eQWi4 AjNkGN1SczEhI/IwxAmYsYgrm0jeq5ikUL0XMaXqHjWhskTmI3NzfIoE9P+YN09vuLuo ppcpVz3qpRjvI5L5COWHVlo6DeHEscYlq7p25juk92Au+9JR01JWmyqpW3JLNTnS7jeI crD3hbfKY+upKkrAns+LK8Lh0YkOsdLWwCTauHlcpxQ3jzLWMLuOh3aLikdzjZdRyXo8 hN1YzTDYAMbh/UBVtHNBaWlh32BGdYl6r0H4oCwVdyOyU8+qboc8wFPA8QDLf0YIbhGa tfFQ== MIME-Version: 1.0 X-Received: by 10.107.6.14 with SMTP id 14mr11047557iog.171.1439929530480; Tue, 18 Aug 2015 13:25:30 -0700 (PDT) Received: by 10.107.184.196 with HTTP; Tue, 18 Aug 2015 13:25:30 -0700 (PDT) In-Reply-To: References: Date: Tue, 18 Aug 2015 14:25:30 -0600 Message-ID: Subject: Re: Sorting the inputSplits From: Nishanth S To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113f97b22a7e49051d9bb80d --001a113f97b22a7e49051d9bb80d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thank you.I have explained the problem better here below.Is this possible?. We have a use case where we have files in the below directory structure. The requirement is that we should not process files inside a Parent directory in parallel(1.txt and 2.txt cannot be processed in parallel since we need to do some check pointing we have to process the oldest file first).How ever 1.txt and 5.txt can be processed in parallel. Right now I am over riding the list status method to pick only the oldest file but this means I cannot achieve parallelism outside the parent as well since the number of input splits is always 1. What would be the way to go about this use case ?.In short I want to achieve parallelism outside Parent directory but not within it. Please advise. published/ +-- Parent1/ =C2=A6 +-- 1.txt =C2=A6 +-- 2.txt =C2=A6 +-- 3.txt +-- Parent2/ +-- 4.txt +-- 5.txt On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov wrote: > Can you clarify the requirement "processed first"? Maps run in parallel > without any ordering guarantees. If you want to affect the mapping > file->split number, you can implement your own getSplits in the custom > input format and return splits ordered anyway your like. > > On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S > wrote: > >> Hey folks, >> >> Is their a way to sort the input splits in map reduce.We have a case >> where there are two files file1 and file2 in the input directory.Since w= e >> have custominputformat which has issplittable return false always eac= h >> of these files would be processed by a different mapper.How could I m= ake >> sure that file1 is processed before file2(I want the oldest file to = be >> processed first).Is this possible?. >> >> Thanks, >> Nishan >> > > --001a113f97b22a7e49051d9bb80d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thank you.I have =C2=A0 explained the problem better here = below.Is this possible?.


We have a use case where we ha= ve files in=C2=A0=C2=A0 the below directory structure. The requirement is t= hat we=C2=A0 should not process files inside a Parent directory in parallel= (1.txt and 2.txt=C2=A0 cannot be processed in parallel since we need to do = some check pointing we have to process the oldest file first).How ever 1.tx= t and 5.txt can be processed in parallel. Right now I=C2=A0 am=C2=A0 over r= iding the list status method to pick only the oldest file but this means I = cannot achieve parallelism outside the parent as well since the number of i= nput splits is always 1. What would be=C2=A0 the way to go about this use c= ase ?.In short I want to achieve parallelism outside Parent directory but n= ot within it. Please advise.

=C2=A0

published/

+-- Parent1/

=C2=A6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-- 1.txt

=C2=A6=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-- 2.txt

=C2=A6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 += -- 3.txt

+= -- Parent2/

=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0+-- 4.txt

=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0+-- 5.txt




On Wed, Jul 29, 2015 at 5:31 PM, Gera Sheg= alov <gera@shegalov.com> wrote:
Can you clarify the requirement "processed firs= t"? Maps run in parallel without any ordering guarantees. If you want = to affect the mapping file->split number, you can implement your own get= Splits in the custom input format and return splits ordered anyway your lik= e.=C2=A0

On Wed, Jul 22, 2015 at 12:06 PM, Nish= anth S <chinchu2884@gmail.com> wrote:
Hey folks,

Is their a w= ay to sort the input splits =C2=A0in map reduce.We have a case where there = are two files file1 and file2 in the input directory.Since we =C2=A0have cu= stominputformat which =C2=A0 has issplittable return false always each of = =C2=A0these files would be processed =C2=A0by =C2=A0a different mapper.How = could I make sure that =C2=A0file1 is processed =C2=A0 before =C2=A0file2(I= want the oldest file to =C2=A0be processed first).Is this possible?.
=

Thanks,
Nishan


--001a113f97b22a7e49051d9bb80d--