hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikhail Yakshin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
Date Thu, 10 Mar 2011 20:49:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005326#comment-13005326
] 

Mikhail Yakshin commented on HDFS-1742:
---------------------------------------

I seriously doubt that making pubsub-like event transmission as the *only* available option
is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown
messaging subsystem akin to ones that implement [JMS|http://en.wikipedia.org/wiki/Java_Message_Service],
such as [ActiveMQ|http://activemq.apache.org/]. In turn, it means a whole other system, matching
Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial
configuration and deployment), being installed and made mandatory by Hadoop.

The only thing I try to argue about is making this thing *modular* - i.e. making JMS pubsub
producer *an option*, but not the *only* option. Other options might be simple local file
logging, sending them across the network, plugging some local workflow management system,
etc, etc.

> Provide hooks / callbacks to execute some code based on events happening in HDFS (file
/ directory creation, opening, closing, etc)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1742
>                 URL: https://issues.apache.org/jira/browse/HDFS-1742
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Mikhail Yakshin
>              Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on the data
that appears in HDFS: for example, we have a job that works on day's worth of data and creates
output in {{/output/YYYY/MM/DD}}. For input, it should wait for directory with externally
uploaded data as {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to
appear, i.e. {{/output/YYYY/MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for files/directories
we're waiting for, but generally it's a bad solution. The better one is something like file
alteration monitor or [inode activity notifiers|http://en.wikipedia.org/wiki/Inotify], such
as ones implemented in Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed on every
major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that implement
callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
>     public void onFileCreate(SomeFileInformation f);
>     public void onFileClose(SomeFileInformation f);
>     public void onFileDelete(SomeFileInformation f);
>     ...
> }
> {code}
> It might be possible to creates a class that implements this method and load it somehow
(for example, using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration
option that specifies names of such class(es) - then NameNode instantiates them and calls
methods from them (in a separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a class that
would be most likely distributed as contrib. Default NameNode's process would stay the same
without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable Scheduler
interfaces, such as [Fair Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
[Capacity Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
[Dynamic Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
etc. It also uses a class(es) that loads and runs inside JobTracker's context, few relatively
trustued varieties exist, they're distributed as contrib and purely optional to be enabled
by cluster admin.
> This would allow systems such as I've described in the beginning to be implemented without
polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message