hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: question on FileInputFormat.addInputPath and data access
Date Wed, 24 Oct 2012 14:40:51 GMT
Hi Andy,

Inline.

On Wed, Oct 24, 2012 at 7:53 PM, Kartashov, Andy <Andy.Kartashov@mpac.ca> wrote:
> Gents,
>
> Two questions:
>
> 1.       Say you have 5 folders with input data
> (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
>
> You will write your MR job to access your files by listing them in :
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);
>
> Q: Is there a way to move the above folders to the parent folder say,
> “the_folder”, so that the dir struct will be the_folder/fold1,
> the_folder/fold2... Will it be possible to access your files with something
> like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar?
>
> I am asking in case your input folders list grows too long. How to curb
> that?

Yes, the FileInputFormat.addInputPath(…) API [1] supports glob
patterns and you can pass it a Path object of "the_fold/*/*" or so.

[1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)

> 2.       Hypothetically speaking  in fully-dist mode cluster your folders
> with Data are located as follows:  Node1: (fold1,fold2,fold3) and
> Node2:(fold4, fold5)
>
> Q: Do we change below command  or will NN and JT  take care how of locating
> those files?
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

JT and NN take care of data locality for you. You need not worry about
that (manually) at all.

>      2a.     Using Data balancer which splits input/moves Data across
> additional DNs indicated in conf/slaves,  is it possible to run “hdfs dfs
> –ls –r “ command  on the slave node that runs DN on a separate machine? I
> have

Yes, you can run regular HDFS client operations (such as ls, cat, job
submission) from any machine, regardless of the machine being or not
being a slave or master node. The form of access a client program uses
is not tied to those files/aspects.

> Cheers,
>
> AK
>
> NOTICE: This e-mail message and any attachments are confidential, subject to
> copyright and may be privileged. Any unauthorized use, copying or disclosure
> is prohibited. If you are not the intended recipient, please delete and
> contact the sender immediately. Please consider the environment before
> printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui
> l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent
> être couverts par le secret professionnel. Toute utilisation, copie ou
> divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire
> prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur.
> Veuillez penser à l'environnement avant d'imprimer le présent courriel



-- 
Harsh J

Mime
View raw message