ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haithem Turki <turki.hait...@gmail.com>
Subject IGFS YARN setup
Date Thu, 26 May 2016 21:56:14 GMT

I'm interested in using IGFS as a Hadoop caching layer - the usecase
revolves largely around Spark jobs running on a YARN cluster that persist
data to S3 (although I have some non-Spark stuff running too so would
ideally integrate at the Hadoop filesystem layer). I'm excited about the
potential speedups that this could bring :)

I took a stab at deploying this for the first time, and had some questions:

- I ideally was envisioning deploying nodes via YARN to take advantage of
dynamic scaling and use any available memory on the cluster, I wanted to
make sure that this was indeed a supported workflow / on the roadmap as I
hit a few bumps along the way:
* I ended up needing to dump pretty much all of my Hadoop-related jars to
HDFS for my nodes to startup correctly (or else I was getting
ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite
classes not being there). Am I doing something horribly wrong / have you
guys considered package a fat jar for the non-hadoop dependencies at least?
* Couldn't specify the yarn queue despite attempting to
set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable (
* Seems like dynamic allocation isn't supported? Wanted to get a sense of
whether this was in the roadmap
* Since YARN allocates containers at random it's pretty onerous to figure
out which hostnames have Ignite nodes running on them and specifying those
in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port
conflicts if multiple nodes are running on the same machine) and I guess I
can set up a reverse proxy so that I can point towards a stable URL but
it's not great / doesn't scale well so I was wondering if there were other
suggestions on how to configure discovery (maybe spin up a local node
outside of YARN that leverages the cluster discovery?)
* I also wasn't clear on how cluster routing/balancing worked. If I specify
my hadoop jobs to point at host1:10500 via TCP, will all read/writes route
through that node or do the reads/writes somehow get balanced?

Or is this completely crazy / should I just deploy IGFS outside of YARN?

- Is there a way of configuring the local filesystem as a tiered storage
layer (or is it on the roadmap)? Usecase is that even reading from an SSD
is much faster than S3.

Thanks in advance!
- Haithem

View raw message