hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: do HDFS files starting with _ (underscore) have special properties?
Date Sat, 03 Sep 2011 20:39:43 GMT
Meng,

- Moving this discussion to cdh-user@cloudera.org since it may be CDH
specific at this point. (Link:
https://groups.google.com/a/cloudera.org/group/cdh-user)
- I've bcc'd common-user@ for this mail alone.
- Added you on cc in case you aren't subscribed.

Reading your version output, that version is CDH2, the older version
of CDH. Would you be able to upgrade your cluster to CDH3?

I haven't tried running against your _exact_ version yet, but running
against the latest CDH2 version of HDFS from
http://archive.cloudera.com/cdh/2/, I think it still works fine (ditto
code in the jar again):

➜  hadoop-0.20.1+169.127 > bin/hadoop jar ~/globtester.jar
hdfs://localhost/user/harshchouraria/_abc
hdfs://localhost/user/harshchouraria/_def
hdfs://localhost/user/harshchouraria/abc
hdfs://localhost/user/harshchouraria/def

On Sun, Sep 4, 2011 at 12:04 AM, Meng Mao <mengmao@gmail.com> wrote:
> I get the opposite behavior --
>
> [this is more or less how I listed the files in the original email]
> hadoop dfs -ls /test/output/solr-20110901165238/part-00000/data/index/*
> -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
> -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
> -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
> -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
> /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
> -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
> -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
> -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
> /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
> -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
> /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
> -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
> /test/output/solr-20110901165238/part-00000/data/index/segments.gen
> -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
> /test/output/solr-20110901165238/part-00000/data/index/segments_2
>
> Whereas my globStatus doesn't capture them.
>
> I thought we were on Cloudera's CDH3, but now I'm not sure. This is what
> version reports:
> $ hadoop version
> Hadoop 0.20.1+169.56
> Subversion  -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3
> Compiled by root on Tue Feb  9 13:40:08 EST 2010
>
>
>
>
>
> On Fri, Sep 2, 2011 at 11:45 PM, Harsh J <harsh@cloudera.com> wrote:
>
>> Meng,
>>
>> What version of hadoop are you on? I'm able to use globStatus(Path)
>> for '_' listing successfully, with a '*' glob. Although the same
>> doesn't apply to what FsShell's ls utility provide (which is odd
>> here!).
>>
>> Here's my test code which can validate that the listing is indeed
>> done: http://pastebin.com/vCbd2wmK
>>
>> $ hadoop dfs -ls
>> Found 4 items
>> drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 09:09
>> /user/harshchouraria/_abc
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/_def
>> drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 08:10
>> /user/harshchouraria/abc
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/def
>>
>>
>> $ hadoop dfs -ls '*'
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/_def
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/def
>>
>> $ # No dir results! ^^
>>
>> $ hadoop jar myjar.jar # (My code)
>> hdfs://localhost/user/harshchouraria/_abc
>> hdfs://localhost/user/harshchouraria/_def
>> hdfs://localhost/user/harshchouraria/abc
>> hdfs://localhost/user/harshchouraria/def
>>
>> I suppose that means globStatus is fine, but the FsShell.ls(…) code
>> does something more than a simple glob status, and filters away
>> directory results when used with a glob.
>>
>> On Sat, Sep 3, 2011 at 3:07 AM, Meng Mao <mengmao@gmail.com> wrote:
>> > Is there a programmatic way to access these hidden files then?
>> >
>> > On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo <edlinuxguru@gmail.com
>> >wrote:
>> >
>> >> On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao <mengmao@gmail.com> wrote:
>> >>
>> >> > We have a compression utility that tries to grab all subdirs to a
>> >> directory
>> >> > on HDFS. It makes a call like this:
>> >> > FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));
>> >> >
>> >> > and handles files vs dirs accordingly.
>> >> >
>> >> > We tried to run our utility against a dir containing a computed SOLR
>> >> shard,
>> >> > which has files that look like this:
>> >> > -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
>> >> > -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
>> >> > -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
>> >> > -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
>> >> > -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
>> >> > -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
>> >> > -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
>> >> > -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
>> >> > -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
>> >> > /test/output/solr-20110901165238/part-00000/data/index/segments.gen
>> >> > -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
>> >> > /test/output/solr-20110901165238/part-00000/data/index/segments_2
>> >> >
>> >> >
>> >> > The globStatus call seems only able to pick up those last 2 files;
the
>> >> > several files that start with _ don't register.
>> >> >
>> >> > I've skimmed the FileSystem and GlobExpander source to see if there's
>> >> > anything related to this, but didn't see it. Google didn't turn up
>> >> anything
>> >> > about underscores. Am I misunderstanding something about the regex
>> >> patterns
>> >> > needed to pick these up or unaware of some filename convention in
>> HDFS?
>> >> >
>> >>
>> >> Files starting with '_' are considered 'hidden' like unix files starting
>> >> with '.'. I did not know that for a very long time because not everyone
>> >> follows this rule or even knows about it.
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J

Mime
View raw message