hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)
Date Thu, 10 Mar 2011 14:59:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005092#comment-13005092
] 

Allen Wittenauer commented on HDFS-1742:
----------------------------------------

The namenode, secondary nn, etc should never be running user code directly.  It won't scale
and it will introduce an incredible amount of instability.  

It would be much better if this was designed in such a way that it was a completely separate
process (or gang of processes).  This process could be fed by receiving the edits stream similar
to how Checkpoint and Backup nodes work today.

> Provide hooks / callbacks to execute some code based on events happening in HDFS (file
/ directory creation, opening, closing, etc)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1742
>                 URL: https://issues.apache.org/jira/browse/HDFS-1742
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Mikhail Yakshin
>              Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on the data
that appears in HDFS: for example, we have a job that works on day's worth of data and creates
output in {{/output/YYYY/MM/DD}}. For input, it should wait for directory with externally
uploaded data as {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to
appear, i.e. {{/output/YYYY/MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for files/directories
we're waiting for, but generally it's a bad solution. The better one is something like file
alteration monitor or [inode activity notifiers|http://en.wikipedia.org/wiki/Inotify], such
as ones implemented in Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed on every
major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that implement
callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
>     public void onFileCreate(SomeFileInformation f);
>     public void onFileClose(SomeFileInformation f);
>     public void onFileDelete(SomeFileInformation f);
>     ...
> }
> {code}
> A user creates a class that implements this method and loads it somehow (for example,
using an extra jar in classpath) in NameNode's JVM. NameNode includes a configuration option
that specifies names of such class(es) - then NameNode instantiates them and calls methods
from them (in a separate thread) on every valid event happening.
> This would allow systems such as I've described in the beginning to be implemented without
polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message