accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Cordova <aa...@cordovas.org>
Subject Re: Accumulo Configuration Question
Date Thu, 31 Jan 2013 15:57:41 GMT

On Jan 31, 2013, at 10:19 AM, "Parker, Matthew - IS" <Matthew.Parker@exelisinc.com>
wrote:

> TWIMC:
> 
> I'm new to Accumulo and I've been trying to come up with a good architecture for a 20
node cluster. I have been running a map/reduce program, and it encounters issues when it comes
to running the Accumulo section of the code. Once the job's completion rate exceeds 93, it
starts dropping 10's of tasks during the process, because they eventually timeout. The completion
rate drops back down, but it the job eventually finishes. I have a suspicion it's due to the
way I have the system configured and I wanted to get some feedback as to what's the generally
preferred architecture when installing Accumulo?

Whether tasks timeout can be due to the data and the reduce logic, in addition to the configuration.
Are things timing out in the reduce phase?

Also, do you notice that it's the same tasktrackers that experience timeouts?

Finally, are you doing MapReduce from HDFS to HDFS? Or are you reading from or writing to
Accumulo tables? You alluded to an Accumulo section of your code. Are you reading/writing
to/from HFDS but doing scans/lookups/inserts to Accumulo from your mappers or reducers?

> Since you have the choice of installing hdfs, map/reduce, and tablet servers on any three,
the general guideline is to install two per machine (data node and table server, or data nodeand
map/reduce) as per the Hardware section in the Administration documentation.
> 
> http://accumulo.apache.org/1.4/user_manual/Administration.html#Hardware
> 
> Does that mean you have one large group of data nodes that's installed on the majority
of the cluster, or are they somehow split into two groups such that map/reduce & hdfs
runs on one set of nodes, and Accumulo tablet servers and hdfs uses another?

You can certainly just use one large group of HDFS data nodes and mapreduce and accumulo will
work fine. Also - depending on your hardware, you can run all three processes on each node.
You just want to make sure each process has enough ram/cpu.  

If you want to keep Accumulo IO somewhat isolated from MapReduce you can control the location
of HDFS block replicas to a certain degree to achieve more independence of failures and IO.
Of course writing to or reading from Accumulo in a MapReduce will still absorb resources from
the Accumulo side.


> I was wondering whether people would comment on what a working configuration might look
like?
> 
> TIA,
> 
> Matt 
> 
> 
> This e-mail and any files transmitted with it may be proprietary and are intended solely
for the use of the individual or entity to whom they are addressed. If you have received this
e-mail in error please notify the sender. Please note that any views or opinions presented
in this e-mail are solely those of the author and do not necessarily represent those of Exelis
Inc. The recipient should check this e-mail and any attachments for the presence of viruses.
Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.



Mime
View raw message