mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: About the Bayes TrainerDriver
Date Wed, 07 May 2008 19:17:40 GMT
Thanks Andrzej!  That makes sense.  I hope to look making the  
Classifier M/R ready in about 1.5 weeks (relatives in town), but if  
someone else wants to tackle it sooner, by all means, jump in.

-Grant

On May 6, 2008, at 4:16 PM, Andrzej Bialecki wrote:

> Grant Ingersoll wrote:
>> On May 6, 2008, at 8:04 AM, phonechen wrote:
>>> sorry , I make a mistake,
>>> what I means is that ,shall we put  the doc to be classified to  
>>> HDFS and
>>> leave the Model files on the HDFS and
>>> make the whole classify process run on the HDFS,
>>> so what to change is :
>>> =====================
>>>  Configuration conf = new JobConf();
>>> FileSystem raw = new RawLocalFileSystem();
>>> raw.setConf(conf);
>>> FileSystem fs = new LocalFileSystem(raw);
>>> ==================
>>> to
>>> ========================
>>>  Configuration conf = new JobConf();
>>>  FileSystem fs = new DistributedFileSystem();
>>> =======================
>
> Speaking as a Hadoop developer ... You should do neither, i.e. you  
> should not instantiate explicitly any FileSystem implementations.  
> There are many reasons for this (object pooling, cleanup, caching,  
> etc).
>
> The canonical idiom for this is the following:
>
> 	FileSystem fs = FileSystem.get(conf);
>
> This way you get either a local FS, or HDFS, or Amazon S3, or KFS,  
> or whatever else is configured as the default filesystem. The  
> benefit is obvious - you don't have to change the code if your  
> configuration changes, i.e. you can transparently move your  
> application from local FS to DFS or S3. Some FS implementations may  
> use pooling, which happens behind the scenes if you use the above.
>
> If you really, really need a local fs, you should use the following  
> idiom:
>
> 	LocalFileSystem localFS = FileSystem.getLocal(conf);
>
> Depending on the configuration (and Hadoop version) you could get  
> different subclasses of a local FS.
>
> Now, what to do if you use something (e.g. HDFS) by default, but you  
> want to make sure that you retrieve some resource that resides on  
> specific other FS? You should use a fully qualified URI when  
> constructing a Path, i.e. a URI that also contains a schema.
>
> Example:
>
> 	Path localPath = new Path("file:///etc/hosts");
> 	Path hdfsPath = new Path("hdfs://namenode:9000/user/data/file");
>
> localPath will use a LocalFileSystem, no matter what FS is the  
> default, and hdfsPath will use DistributedFileSystem that can be  
> reached at the host "namenode" and port 9000, no matter what is the  
> current FS configuration.
>
> And finally - to learn what is the current FileSystem that a Path  
> refers to, do the following:
>
> 	Path unqualified = new Path("/etc/hosts");
> 	FileSystem fs = unqualified.getFileSystem(conf);
>
> You can also make a fully qualified path from a path that is missing  
> explicit schema, and may be relative to your current working  
> directory:
>
> 	Path unqualified = new Path("test");
> 	Path qualified = unqualified.makeQualified(fs);
>
> If your cwd=/home/nutch and your file system is local, then  
> qualified.toString() would give "file:///home/nutch/test".
>
> Hope this helps ...
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Mime
View raw message