Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of gnodet@gmail.com designates
 209.85.216.176 as permitted sender)
MIME-Version: 1.0
Date: Mon, 9 Jul 2012 15:24:45 +0200
Message-ID: 
 <CAEQV+EFcXPPD_6JSczLRDKO=CRo4MMFB4bBf00qqB4GAoJpZJQ@mail.gmail.com>
Subject: OSGi and classloaders
From: Guillaume Nodet <gnodet@gmail.com>
To: hdfs-dev@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3074b7c28b1ce804c4658984

--20cf3074b7c28b1ce804c4658984
Content-Type: text/plain; charset=ISO-8859-1

I'm working with Jean-Baptiste to make hadoop work in OSGi.
OSGi works with classloader in a very specific way which leads to several
problems with hadoop.

Let me quickly explain how OSGi works.  In OSGi, you deploy bundles, which
are jars with additional OSGi metadata.  This metadata is used by the OSGi
framework to create a classloader for the bundle.  However, the
classloaders are not organized in a tree like in a JEE environment, but
rather in some kind of graph, where each classloader has limited visibility
and limited exposure.  This is controlled by at the package level by
specifying which packages are exported and which packages are imported by a
given bundle.   This is mainly two consequences:
  * OSGi does not supports well split-packages, where the same package is
exported by two different bundles
  * a classloader does not have visibility on everything as in a usual flat
classloader environment or even JEE-like env

The first problem arise for example with the org.apache.hadoop.fs package
which is split across hadoop-common and hadoop-hdfs jars (which defines the
Hdfs class).  There may be other cases, but I haven't hit them yet.  To
solve this problem, it'd be better if such classes were moved into a
different package.

The second problem is much more complicated.   I think most of the
classloading is done from Configuration.  However, Configuration has an
internal classloader which is set by the constructor to the thread context
classloader (defaulting to the Configuration class' classloader) and new
Configuration objects are created everywhere in the code.
In addition, creating new Configuration objects force the parsing of the
configuration files several times.
Also in OSGi, Configuration is better done through the standard OSGi
ConfigurationAdmin service, so it would be nice to integrate the
configuration into ConfigAdmin when running in OSGi.
For the above reasons, I'd like to know what would you think of
transforming the Configuration object into a real singleton, or at least
replacing the "new Configuration()" call spread everywhere with the access
to a singleton Configuration.getInstance().
This would allow  the hadoop osgi layer to manage the Configuration in a
more osgi friendly way, allowing the use of a specific subclass which could
better manage the class loading in an OSGi environment and integrate with
ConfigAdmin.  This may also remove the need for keeping a registry of
existing Configuration and having to update them when a default resource if
added for example.

Some of the above problems have been addressed in some way in HADOOP-7977,
but the fixes I've been working on were more related to hadoop 1.0.x
branch, and are slightly unapplicable to trunk.

One last point: the two above problems are mainly due to the fact that I've
been assuming that individual hadoop jars are transformed into native
bundles.  This would go away if we'd have a single bundle containing all
the individual jars (as it was with hadoop-core-1.0.x, but having more fine
grained jars is better imho.

Thoughts welcomed.

-- 
------------------------
Guillaume Nodet
------------------------
Blog: http://gnodet.blogspot.com/
------------------------
FuseSource, Integration everywhere
http://fusesource.com

--20cf3074b7c28b1ce804c4658984--