Few months back I wondering about the same topic. Unfortunately dependency management and importing libraries is not the strongest suit of Hadoop related libraries and that includes ORC. We got with our project to the point when we considered forking ORC and just create our own version of it becuase we want to use it outside Hadoop. Unfortunately Hadoop related code is all over the place so we decided to just exclude a bunch of libraries and we ended up with a pom.xml like this:


Keep in mind this is an older version of ORC that is included in the Hive 1.2.1 release. I also started to work on a project to deal with Hadoop dependencies easier but we dropped the entire project altogether.

I think what would be reasonable is to have libraries like ORC at the bottom of the dependency stack (orc-core) and create a library that provides an interface for Hadoop or any project that wants to use this file format (orc-hadoop, orc-something, etc.) so that we don't have this dependency hell that you can see in projects like ORC. I am not sure who else is interested in such a project but if you are I think I could provide you some development time.

Owen was really helpful with the efforts. See more here: https://issues.apache.org/jira/browse/ORC-151 https://github.com/apache/orc/pull/96


On Wed, Jan 17, 2018 at 6:16 PM, Jeff Evans <jeffrey.wayne.evans@gmail.com> wrote:

I am a software engineer with StreamSets, and am working on a project
to incorporate ORC support into our product.  The first phase of this
will be to support Avro to ORC conversion. (I saw a post on this topic
to this list a couple months ago, before I joined.  Would be happy to
share more details/code for scrutiny once it's closer to completion.)

One issue I'm running into is the dependency of orc-core on
hadoop-common.  Our product can be deployed in a variety of Hadoop
distributions from different vendors, and also standalone (i.e. not in
Hadoop at all).  Therefore, this dependency makes it difficult for us
to incorporate orc-core in a central way in our codebase (since the
vendor typically provides this jar in their installation).  Besides
that, hadoop-common also brings in a number of other problematic
dependencies for us (the deprecated com.sun.jersey group for Jersey
and zookeeper, to name a couple).

Does anyone have suggestions for how to work around this?  It seems
the only actual classes I reference are the same ones referenced in
the core-java tutorial (org.apache.hadoop.conf.Configuration and
org.apache.hadoop.fs.Path), although obviously the library may be
making use of more itself.  Are there any plans to remove the
dependency on Hadoop down the line, or should I accommodate this by
shuffling our dependencies such that our code only lives in a
Hadoop-provided packaging configuration?  Any insight is appreciated.

