Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 957B7D653 for ; Wed, 24 Oct 2012 14:24:06 +0000 (UTC) Received: (qmail 3277 invoked by uid 500); 24 Oct 2012 14:24:01 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 3170 invoked by uid 500); 24 Oct 2012 14:24:01 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 3148 invoked by uid 99); 24 Oct 2012 14:24:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 14:24:01 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [206.47.135.205] (HELO Spam1.prd.mpac.ca) (206.47.135.205) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 14:23:52 +0000 Received: from Spam1.prd.mpac.ca (unknown [127.0.0.1]) by IMSVA80 (Postfix) with ESMTP id 4CEF51D8066 for ; Wed, 24 Oct 2012 10:23:28 -0400 (EDT) Received: from SMAIL1.prd.mpac.ca (unknown [172.29.2.53]) by Spam1.prd.mpac.ca (Postfix) with ESMTP id B50771D8062 for ; Wed, 24 Oct 2012 10:23:27 -0400 (EDT) Received: from SMAIL1.prd.mpac.ca ([fe80::d548:4221:967c:4cfb]) by SMAIL1.prd.mpac.ca ([fe80::18cb:8648:b77f:2b55%11]) with mapi id 14.02.0318.004; Wed, 24 Oct 2012 10:23:27 -0400 From: "Kartashov, Andy" To: "user@hadoop.apache.org" Subject: question on FileInputFormat.addInputPath and data access Thread-Topic: question on FileInputFormat.addInputPath and data access Thread-Index: Ac2x8yCcTK+d0hmIT+64s/GZFfGlmg== Date: Wed, 24 Oct 2012 14:23:27 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.29.60.102] Content-Type: multipart/alternative; boundary="_000_BD42F346AE90F544A731516A805D1B8AD822C0SMAIL1prdmpacca_" MIME-Version: 1.0 X-TM-AS-Product-Ver: IMSVA-8.0.0.1304-6.5.0.1024-19302.003 X-TM-AS-Result: No--26.626-5.0-31-10 X-imss-scan-details: No--26.626-5.0-31-10 X-TM-AS-Result-Xfilter: Match text exemption rules:No X-TMASE-MatchedRID: 21gvKF2HRCES5iXdhf8hbhVKfpQcmasmWjWsWQUWzVpFDR0AKGX+XH7m B6jTPtUeEmcLD8WhxtSka2aWT/RUVc6tU1ozj47XGLXhwJ3YV6N3T8gwfPR6+SP8GBeiiqDD7/Q q53TSLTByBu3TZakWXNVd7XX9PYXSt7Bj6e6QW4ixo9yzdPhMvTgKsXDvMF5bIYP4Wne9kdSDhA 5MZGP97eH25A/VcXeBdUAulK0GzSqc95xD+Eo4wADPuhU4P53K/akb2jtl/yEFXFSkfaz0ceXzA szU1yNmpEYBIO560crniZuiwpS5h+dlXNS6RXKkZ6unGlnCDgv4bPjfm0hIQeKYpwtG+wHV8FhG jTp5WPfe2M7eE7V+nawvTQpsPBALizFhECDuofD4pTO56aJ0/F+iEcKpKdpuFRlN8zTSzj4nKxR 25jGCCNGD72D5Q6jSklPOPDP4bOiuEAHkLyBnqooLoibgjVEXm/y00tE9Sta7JY/ye+jZzyJkpI MWHcbC92grUwQgYZcl9IJzJrAT7nMbiLLKBHkEkZHIbcG0KQtzwvZHV+t4pHBO2mdrSsJWKNTTy iZQ3iubZcmBCDwj1IDYuMRUchGZ X-Virus-Checked: Checked by ClamAV on apache.org --_000_BD42F346AE90F544A731516A805D1B8AD822C0SMAIL1prdmpacca_ Content-Type: text/plain; charset="us-ascii" Gents, Two questions: 1. Say you have 5 folders with input data (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster. You will write your MR job to access your files by listing them in : FileInputFormat.addInputPaths(job, "fold1, fold2, fold3...,fold5"); Q: Is there a way to move the above folders to the parent folder say, "the_folder", so that the dir struct will be the_folder/fold1, the_folder/fold2... Will it be possible to access your files with something like: FileInputFormat.addInputPaths(job, "the_fold1/*"); or similar? I am asking in case your input folders list grows too long. How to curb that? 2. Hypothetically speaking in fully-dist mode cluster your folders with Data are located as follows: Node1: (fold1,fold2,fold3) and Node2:(fold4, fold5) Q: Do we change below command or will NN and JT take care how of locating those files? FileInputFormat.addInputPaths(job, "fold1, fold2, fold3...,fold5"); 2a. Using Data balancer which splits input/moves Data across additional DNs indicated in conf/slaves, is it possible to run "hdfs dfs -ls -r " command on the slave node that runs DN on a separate machine? I have Cheers, AK NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel --_000_BD42F346AE90F544A731516A805D1B8AD822C0SMAIL1prdmpacca_ Content-Type: text/html; charset="us-ascii"

Gents,

Two questions:

1.       Say you have 5 folders with input data (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.

You will write your MR job to access your files by listing them in :

FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

Q: Is there a way to move the above folders to the parent folder say, “the_folder”, so that the dir struct will be the_folder/fold1, the_folder/fold2... Will it be possible to access your files with something like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar?

I am asking in case your input folders list grows too long. How to curb that?

2.       Hypothetically speaking  in fully-dist mode cluster your folders with Data are located as follows:  Node1: (fold1,fold2,fold3) and  Node2:(fold4, fold5)

Q: Do we change below command  or will NN and JT  take care how of locating those files?

FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

     2a.     Using Data balancer which splits input/moves Data across additional DNs indicated in conf/slaves,  is it possible to run “hdfs dfs –ls –r “ command  on the slave node that runs DN on a separate machine? I have

Cheers,

AK

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel --_000_BD42F346AE90F544A731516A805D1B8AD822C0SMAIL1prdmpacca_--