Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 31292 invoked from network); 19 Nov 2007 22:56:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Nov 2007 22:56:49 -0000 Received: (qmail 57725 invoked by uid 500); 19 Nov 2007 22:56:35 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 57695 invoked by uid 500); 19 Nov 2007 22:56:35 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 57686 invoked by uid 99); 19 Nov 2007 22:56:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2007 14:56:35 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of glashammer@hotmail.com designates 65.54.246.175 as permitted sender) Received: from [65.54.246.175] (HELO bay0-omc2-s39.bay0.hotmail.com) (65.54.246.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2007 22:56:23 +0000 Received: from hotmail.com ([65.54.250.73]) by bay0-omc2-s39.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 19 Nov 2007 14:56:16 -0800 Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Mon, 19 Nov 2007 14:56:16 -0800 Message-ID: Received: from 81.226.57.71 by BAY115-DAV1.phx.gbl with DAV; Mon, 19 Nov 2007 22:56:14 +0000 X-Originating-IP: [81.226.57.71] X-Originating-Email: [glashammer@hotmail.com] X-Sender: glashammer@hotmail.com From: "glashammer" To: Subject: RE: design question - multiple artifact sets and privacy Date: Mon, 19 Nov 2007 23:54:20 +0100 Message-ID: <058f01c82aff$1f0ef480$6400a8c0@HENRIK2005> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 11 In-Reply-To: Thread-Index: AcgqyFbxkjo9NYp5RRSNiaZnpXxRfwADLn1xAApO7dA= X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138 X-OriginalArrivalTime: 19 Nov 2007 22:56:16.0811 (UTC) FILETIME=[6450E7B0:01C82AFF] X-Virus-Checked: Checked by ClamAV on apache.org Thanks Ted, I was thinking something like what you suggest with "running multiple clusters" and possibly making all the "global public data" read only, = all writes would go into diskspace owned/controlled by the client - i.e. = each client running their own map/reduce cluster but using the global data as "precalculated input". Regards - henrik -----Original Message----- From: Ted Dunning [mailto:tdunning@veoh.com]=20 Sent: m=E5ndag 19 november 2007 18:53 To: hadoop-user@lucene.apache.org Subject: Re: design question - multiple artifact sets and privacy Hadoop currently has no sense of user-private data. I believe that this capability is under development, but I don't know what the time-line for completion is (if any). You should not expect the capabilities to be = fully usable when they are first release. In spite of this lack, I think you could meet your requirements in a few different ways. For instance, you could emulate user isolation by = running multiple hadoop clusters on the same machines under different uid's. Virtualization would be another option, but I would guess that you would rather share disk space so that large clusters like your public data set could expand easily to nearly all of available disk and so that as = clusters create large numbers of temporary files, they could use more disk space. These approaches only address user isolation, not the general problem of user permissions. In particular, the idea that you might have some = public files and some private files accessed by the same program is not = addressed. This sounds much more difficult than it actually is. There is a single command that can be used to launch a cluster and all of the instances = could share configuration. Another approach would be to use document encryption and build a custom input format. This is very easy to do. You would leave public files in plain text and encrypt private files with customer specific keys. That = way, programs accessing private files could only access files that you want = them to. Standard encryption utilities available in Java are fast enough = that you shouldn't have a major speed penalty. We use AES for a lot of our = input files and while I would prefer a plain text format for speed, our = systems go at a respectable pace. You are right, btw, that your problem sounds ideally suited for = map-reduce. I would recommend that you batch your documents many to a single file. = That will massively improve throughput since you avoid many seek times. On 11/19/07 8:22 AM, "glashammer" wrote: > Hi, > I am currently working on a system design and I am interested in = hearing > some ideas how hadoop/hbase can be used to solve a couple of tricky issues. >=20 > Basically I have a data set consisting of roughly 1-5 million = documents > growing at a rate of about 100-500 thousand a year. Each document is = not of > significant size (say 100k). Documents describe dependencies that are > analyzed, intermediate results are cached as new documents, and = documents > and intermediate results are indexed. Intermediate results are about a > factor 3 times the number of documents. Seems like a perfect thing to = use > map reduce to handle analytics and indexing, both on an initial set, = and > when changes occur. >=20 > My question regards being able to handle multiple sets of artifacts, = and > privacy. >=20 > I would like to be able to handle multiple sets of private documents = and > private intermediate results. When analyzing and indexing these the > documents and partial results must run on servers that are private to = the > client and naturally also be stored in a (to our client) private and secure > fashion. > A client would have a lot less private data than what is stored in our > global and publicly available data set. >=20 > Interested in hearing ideas regarding how this can be done. >=20 > Regards > - henrik >=20