Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of glashammer@hotmail.com
 designates 65.54.246.175 as permitted sender)
Message-ID: <BAY115-DAV144910B9DDCF9CAE4FEE4C87E0@phx.gbl>
From: "glashammer" <glashammer@hotmail.com>
To: <hadoop-user@lucene.apache.org>
Subject: RE: design question - multiple artifact sets and privacy
Date: Mon, 19 Nov 2007 23:54:20 +0100
Message-ID: <058f01c82aff$1f0ef480$6400a8c0@HENRIK2005>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <C3670D8E.28DB2%tdunning@veoh.com>
Thread-Index: AcgqyFbxkjo9NYp5RRSNiaZnpXxRfwADLn1xAApO7dA=

Thanks Ted,
I was thinking something like what you suggest with "running multiple
clusters" and possibly making all the "global public data" read only, =
all
writes would go into diskspace owned/controlled by the client - i.e. =
each
client running their own map/reduce cluster but using the global data as
"precalculated input".

Regards
- henrik

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com]=20
Sent: m=E5ndag 19 november 2007 18:53
To: hadoop-user@lucene.apache.org
Subject: Re: design question - multiple artifact sets and privacy


Hadoop currently has no sense of user-private data.  I believe that this
capability is under development, but I don't know what the time-line for
completion is (if any).  You should not expect the capabilities to be =
fully
usable when they are first release.

In spite of this lack, I think you could meet your requirements in a few
different ways.  For instance, you could emulate user isolation by =
running
multiple hadoop clusters on the same machines under different uid's.
Virtualization would be another option, but I would guess that you would
rather share disk space so that large clusters like your public data set
could expand easily to nearly all of available disk and so that as =
clusters
create large numbers of temporary files, they could use more disk space.
These approaches only address user isolation, not the general problem of
user permissions.  In particular, the idea that you might have some =
public
files and some private files accessed by the same program is not =
addressed.

This sounds much more difficult than it actually is.  There is a single
command that can be used to launch a cluster and all of the instances =
could
share configuration.

Another approach would be to use document encryption and build a custom
input format.  This is very easy to do.  You would leave public files in
plain text and encrypt private files with customer specific keys.  That =
way,
programs accessing private files could only access files that you want =
them
to.  Standard encryption utilities available in Java are fast enough =
that
you shouldn't have a major speed penalty.  We use AES for a lot of our =
input
files and while I would prefer a plain text format for speed, our =
systems go
at a respectable pace.

You are right, btw, that your problem sounds ideally suited for =
map-reduce.
I would recommend that you batch your documents many to a single file.  =
That
will massively improve throughput since you avoid many seek times.


On 11/19/07 8:22 AM, "glashammer" <glashammer@hotmail.com> wrote:

> Hi,
> I am currently working on a system design and I am interested in =
hearing
> some ideas how hadoop/hbase can be used to solve a couple of tricky
issues.
>=20
> Basically I have a data set consisting of roughly 1-5 million =
documents
> growing at a rate of about 100-500 thousand a year. Each document is =
not
of
> significant size (say 100k). Documents describe dependencies that are
> analyzed, intermediate results are cached as new documents, and =
documents
> and intermediate results are indexed. Intermediate results are about a
> factor 3 times the number of documents. Seems like a perfect thing to =
use
> map reduce to handle analytics and indexing, both on an initial set, =
and
> when changes occur.
>=20
> My question regards being able to handle multiple sets of artifacts, =
and
> privacy.
>=20
> I would like to be able to handle multiple sets of private documents =
and
> private intermediate results. When analyzing and indexing these the
> documents and partial results must run on servers that are private to =
the
> client and naturally also be stored in a (to our client) private and
secure
> fashion.
> A client would have a lot less private data than what is stored in our
> global and publicly available data set.
>=20
> Interested in hearing ideas regarding how this can be done.
>=20
> Regards
> - henrik
>=20