Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3B8DEB77 for ; Sun, 3 Mar 2013 19:15:16 +0000 (UTC) Received: (qmail 22120 invoked by uid 500); 3 Mar 2013 19:15:16 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 22074 invoked by uid 500); 3 Mar 2013 19:15:16 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 22066 invoked by uid 99); 3 Mar 2013 19:15:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Mar 2013 19:15:16 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aji1705@gmail.com designates 209.85.217.181 as permitted sender) Received: from [209.85.217.181] (HELO mail-lb0-f181.google.com) (209.85.217.181) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Mar 2013 19:15:11 +0000 Received: by mail-lb0-f181.google.com with SMTP id gm6so3455845lbb.12 for ; Sun, 03 Mar 2013 11:14:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=f6nFIf9Kr04JFnGZFZKoyZ6jw0CVHww1i/ei/UuEWqk=; b=qoD5dMNhq/Ijgh3RFv0hTY38gWJHd3WLP1v6PbCglo45mXbk6UOvZBEgo3hDkQ1bYJ QGUP0dPZPpIRkSEdysEtfcp2li5e9B+PP7zd8OSX4lxx3ynkVYo/ZjjL5oi9nUPRUyDE tgwPpNwiYnUsHGGrF3E6ySIygqMj5dwEegeuBmHfkdyn3oEURbdlNEONfg4a0RLxG10i 6gtE3a/WAQFGmEuhVGvUmlP7/tv7KDGJRJJ4uK2sz5u8f8NKAAVIhqWq1oSBhhnJZslI BuzR/ny4SM9hzApSVqAliOeqroZlebZwLN9TANXPvtX9AkaQPhIvxtmI9/OhY4jVl/H6 2h2g== MIME-Version: 1.0 X-Received: by 10.152.144.138 with SMTP id sm10mr15452367lab.53.1362338090113; Sun, 03 Mar 2013 11:14:50 -0800 (PST) Received: by 10.112.68.140 with HTTP; Sun, 3 Mar 2013 11:14:49 -0800 (PST) In-Reply-To: References: Date: Sun, 3 Mar 2013 14:14:49 -0500 Message-ID: Subject: Re: Mapreduce, Indexing and Logging From: Aji Janis To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=e89a8f22c6f5ece32704d70a0db5 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f22c6f5ece32704d70a0db5 Content-Type: text/plain; charset=ISO-8859-1 John and Ed thank you both for your responses. Using Solr for search is a requirement. When we process data theres quite a bit of information we are interested in indexing (dates, locations, etc) and we use Solr for that. All the data will be stored in Accumulo after processing and then indexed in solr. But since I am trying to do all the processing in map reduce I was interested in hearing any limitations there might be if N (>=60) mappers or reducers try to put things in solr after processing and before writing to accumulo. On Sun, Mar 3, 2013 at 12:32 PM, Ed Kohlwey wrote: > With respect to indexing, what are you trying to achieve? I have not used > Solr with Accumulo but have done indexing directly in Accumulo, leveraging > Lucene libraries as appropriate. You can get very good performance specific > to your domain by doing so and its less O&M overhead. Of c course then you > need to learn all about indexing so there's a little bit of a tradeoff. > On Mar 2, 2013 12:50 PM, "John Vines" wrote: > >> 1. This is quite variable. It depends on your hardware specs, primarily >> CPU and disk throughput. It also depends on how your system is configured >> for these resources and your typical mutation size. How your mutations are >> distributed is another factor. >> 2. Under the hood, the output format uses a BatchWriter. There is a >> guarantee that once a flush comes back from the batchwriter, the data is >> available. Unless explicitly called, the batchwriter will flush whenever >> half of it's capacity is full, or when idle for a short period (I want to >> say 3 seconds, but I could be mistaken). >> 3. If the 2 mutations don't intersect at all, then there's no issue. If >> they have identical columns, then whichever one has the newest timestamp >> will come up first. If you are explicitly setting timestamps or they arrive >> at the same time, the outcome is non-deterministic. >> 4. I'm going to defer this question to someone else >> 5. Ideally each datanode should be a tserver. And they will also be a >> tasktracvker. This will help ensure data locality so you can get around any >> network boundaries/overhead. >> 5. I don't see why not. There's a little bit of log4j statements in the >> Accumulo client, so it would actually make it easier for you to deal with >> them there too. >> >> John >> >> >> On Sat, Mar 2, 2013 at 3:11 PM, Aji Janis wrote: >> >>> Hello, >>> >>> I am investigating how well accumulo will handle mapreduce jobs. I am >>> interested in hearing about any known issues from anyone running mapreduce >>> with accumulo as their source and sink. Specifically, I want to hear your >>> thoughts about the following: >>> >>> Assume cluster has 50 nodes. >>> Accumulo running is on three nodes >>> Solr is on three nodes >>> >>> >>> 1. how many concurrent mutations can accumulo handle - more details on >>> how this works would be extremely helpful >>> 2. is there a delay between when map reduce writes data to table vs. >>> when the data is available for read. >>> 3. how are concurrent mutations to the same row handled (say from >>> different mappers/reducers) since accumulo isn't transactional >>> 4. I am trying to solr index some accumulo data --- are there are any >>> know issues on accumulo end? solr end? how does one vs. multiple shard >>> affect the MR job? >>> 5. should I have more accumulo/ solr nodes (ie an instance on each node >>> in cluster? is that necessary? workarounds?) >>> 5. Normally I have log4j statements all over the java job. Can I still >>> use them with map reduce? >>> >>> >>> I apologize if any of these questions do not belong on this mailing list >>> (and please point me to where I can ask them, if possible). I am trying to >>> gather a lot of information to decide if this is a good approach for me and >>> the level of effort needed so I realize these are a lot of questions. I >>> very much appreciate any and all feedback. Thank you for your time in >>> advance! >>> >>> >> --e89a8f22c6f5ece32704d70a0db5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
John and Ed thank you both for your responses.

Using Solr for search is a requirement. When we process data theres quite= a bit of information we are interested in indexing (dates, locations, etc)= and we use Solr for that. All the data will be stored in Accumulo after pr= ocessing and then indexed in solr. But since I am trying to do all the proc= essing in map reduce I was interested in hearing any limitations there migh= t be if N (>=3D60) mappers or reducers try to put things in solr after p= rocessing and before writing to accumulo.=A0



On Sun, Mar 3, 2013 at 12= :32 PM, Ed Kohlwey <ekohlwey@gmail.com> wrote:

With respect to indexing, what are you trying to achieve? I = have not used Solr with Accumulo but have done indexing directly in Accumul= o, leveraging Lucene libraries as appropriate. You can get very good perfor= mance specific to your domain by doing so and its less O&M overhead. Of= c course then you need to learn all about indexing so there's a little= bit of a tradeoff.

On Mar 2, 2013 12:50 PM, "John Vines" = <vines@apache.org<= /a>> wrote:
1. This is quite variable. It dep= ends on your hardware specs, primarily CPU and disk throughput. It also dep= ends on how your system is configured for these resources and your typical = mutation size. How your mutations are distributed is another factor.
2. Under the hood, the output format uses a BatchWriter. There is a g= uarantee that once a flush comes back from the batchwriter, the data is ava= ilable. Unless explicitly called, the batchwriter will flush whenever half = of it's capacity is full, or when idle for a short period (I want to sa= y 3 seconds, but I could be mistaken).
3. If the 2 mutations don't intersect at all, then there's no= issue. If they have identical columns, then whichever one has the newest t= imestamp will come up first. If you are explicitly setting timestamps or th= ey arrive at the same time, the outcome is non-deterministic.
4. I'm going to defer this question to someone else
5. I= deally each datanode should be a tserver. And they will also be a tasktracv= ker. This will help ensure data locality so you can get around any network = boundaries/overhead.
5. I don't see why not. There's a little bit of log4j stateme= nts in the Accumulo client, so it would actually make it easier for you to = deal with them there too.

John


On Sat, Mar 2, 2013 at 3:11 PM, Aji Jani= s <aji1705@gmail.com> wrote:
Hello,

=A0I am investigating how well accumulo will hand= le mapreduce jobs. I am interested in hearing about any known issues from a= nyone running mapreduce with accumulo as their source and sink. Specificall= y,=A0I want to hear your thoughts about the following:

Assume cluster has 50 nodes.
Accumulo running= is on three nodes
Solr is on three nodes


1. how many concurrent mutations can accumulo handle - mor= e details on how this works would be extremely helpful
2. is there a delay between when map reduce writes data to table vs. w= hen the data is available for read.=A0
3. how are concurrent muta= tions to the same row handled =A0(say from different mappers/reducers) sinc= e accumulo isn't transactional
4. I am trying to solr index some accumulo data --- are there are any = know issues on accumulo end? solr end? how does one vs. multiple shard affe= ct the MR job?
5. should I have more accumulo/ solr nodes (ie an = instance on each node in cluster? is that necessary? workarounds?)
5. Normally I have log4j statements all over the java job. Can I still= use them with map reduce?=A0


I apo= logize if any of these questions do not belong on this mailing list (and pl= ease point me to where I can ask them, if possible). I am trying to gather = a lot of information to decide if this is a good approach for me and the le= vel of effort needed so I realize these are a lot of questions. I very much= appreciate any and all feedback. Thank you for your time in advance!



--e89a8f22c6f5ece32704d70a0db5--