spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Spark configuration with 5 nodes
Date Thu, 17 Mar 2016 12:28:03 GMT
Thanks Steve,

For NN it all depends how fast you want a start-up. 1GB of NameNode
memory accommodates around 42T so if you are talking about 100GB of NN
memory then SSD may make sense to speed up the start-up. Raid 10 is the
best one that one can get  assuming all internal disks.

In general it is also suggested that fsimage are copied across to NFS
mount directory between primary and fail-over in case of an issue.

Cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 March 2016 at 12:02, Steve Loughran <stevel@hortonworks.com> wrote:

>
> On 11 Mar 2016, at 16:25, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Hi Steve,
>
> My argument has always been that if one is going to use Solid State Disks
> (SSD), it makes sense to have it for NN disks start-up from fsimage etc.
> Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring &
> striping) for NN disks?
>
>
> I don't have any suggestions there, sorry. That said, NN disks do need to
> be RAIDed for protection against corruption, as they don't have the
> cross-cluster replication. They matter
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 March 2016 at 10:42, Steve Loughran <stevel@hortonworks.com> wrote:
>
>>
>> On 10 Mar 2016, at 22:15, Ashok Kumar <ashok34668@yahoo.com.INVALID
>> <ashok34668@yahoo.com.invalid>> wrote:
>>
>>
>> Hi,
>>
>> We intend  to use 5 servers which will be utilized for building Bigdata
>> Hadoop data warehouse system (not using any propriety distribution like
>> Hortonworks or Cloudera or others).
>>
>>
>> I'd argue that life is if simpler with either of these, or bigtop+ambari
>> built up yourself, for the management and monitoring tools more than
>> anything else. Life is simpler if there's a web page of cluster status.
>> But: DIY teaches you the internals of how things work, which is good for
>> getting your hands dirty later on. Just start to automate things from the
>> outset, keep configs under SCM, etc. And decide whether or not you want to
>> go with Kerberos (==secure HDFS) from the outset. If you don't, put your
>> cluster on a separate isolated subnet. You ought to have the boxes on a
>> separate switch anyway if you can, just to avoid network traffic hurting
>> anyone else on the switch.
>>
>> All servers configurations are 512GB RAM, 30TB storage and 16 cores,
>> Ubuntu Linux servers. Hadoop will be installed on all the servers/nodes.
>> Server 1 will be used for NameNode plus DataNode as well. Server 2 will be
>> used for standby NameNode & DataNode. The rest of the servers will be
>> used as DataNodes..
>>
>>
>>
>> 1. Make sure you've got the HDFS/NN space allocation on the two NNs set
>> up so that HDFS blocks don't get into the way of the NN's needs (which
>> ideally should be on a separate disk with RAID turned on);
>> 2. Same for the worker nodes; temp space matters
>> 3. On a small cluster, the cost of a DN failure is more significant: a
>> bigger fraction of the data will go offline, recovery bandwidth limited to
>> the 4 remaining boxes, etc, etc. Just be aware of that: in a bigger
>> cluster, a single server is usually less traumatic. Though HDFS-599 shows
>> that even facebook can lose a cluster or two.
>>
>> Now we would like to install Spark on each servers to create Spark
>> cluster. Is that the good thing to do or we should buy additional hardware
>> for Spark (minding cost here) or simply do we require additional memory to
>> accommodate Spark as well please. In that case how much memory for each
>> Spark node would you recommend?
>>
>>
>> You should be running your compute work on the same systems as the data,
>> as its the "hadoop cluster way"; locality of data ==> performance. If you
>> were to buy more hardware, go for more store+compute, rather than just
>> compute.
>>
>> Spark likes RAM for sharing results; less RAM == more problems. but: you
>> can buy extra RAM if you need it, provided you've got space in the servers
>> to put it in. Same for storage.
>>
>> Do make sure that you have ECC memory; there are some papers from google
>> and microsoft on that topic if you want links to the details. Without ECC
>> your data will be corrupted *and you won't even know*
>>
>> -Steve
>>
>>
>>
>
>

Mime
View raw message