hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ittai Zeidman <it...@fashiontraffic.com>
Subject HDFS directory locking and directory polling
Date Tue, 21 Aug 2012 03:42:19 GMT
I'm a noob to hdfs so I might have not understood its purpose but I've read
around and I think it can solve a problem I'm having.
I'm not sure and so I'd appreciate feedback whether this is the right
My current use case involves two machines, one hosting an ftp server and
another hosting my application.
External entities, to this issue, constantly upload files to the ftp server.
My application listens recursively on a specific directory in the ftp, via
apache camel, downloads the file locally whenever a new one appears and
processes it.
I now need to scale my application so it resides on two different machines
and the immediate problem arises from the following the following
Given the following directory structure
At any given time only one file from each directory can be processed by any
machine and the same file can also be run only by a single machine
(precluding failure and retry).
This problem is currently solved by utilizing an in memory map from dir
name to a queue of files and so I only work on one file of the dir at a
time and since there is only one machine then by definition it is the only
one holding the files.
The previous promise of course does not exist, or scale well at least, if
multiple machines poll the same ftp server.
What I'm interested in doing with hdfs is to have my applications poll a
recursive directory on the hdfs and once a new file appears they should try
to lock its directory, whoever wins gets to copy the file.
I saw some old hdfs isssues about directory locking and file locking but I
wasn't sure whether this functionality is available in the format I
described above.
I think my questions are:
1. Can I, easily, recursively poll an hdfs directory? (I'm looking into
hadoop-camel for this)
2. Can I, easily, lock an hdfs directory?
3. If the answer to 2 is no, Will creating a hostname.lock file on an hdfs
directory by nodes work as a manual locking mechanism?
4. Should I try to find a different tool for the job?

I can of course try to find different tools like db locks and so on but I
fear the other solutions I've thought of don't scale well and are very

Would appreciate any feedback,

Ittai Zeidman
Server team leader, Fashion Traffic <http://fashiontraffic.com/>
Follow us on: Twitter  <http://twitter.com/fashiontraffic>| Facebook
| Tumblr <http://fashiontrafficblog.tumblr.com/>

View raw message