hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Guidelines for production cluster
Date Fri, 30 Nov 2012 00:00:11 GMT
Thanks again Gaurav. At least 1 file will be read at a time. File is the
atomic unit and each of these binary file can go upto 1TB in size.

Latency is not a major concern, since almost 99.99% of the stuff would be
offline and will involve batch processing. So, we can compromise there.

Thank you for the pointer. I'll definitely have a look over it.

Regards,
    Mohammad Tariq



On Fri, Nov 30, 2012 at 4:37 AM, Gaurav Sharma
<gaurav.gs.sharma@gmail.com>wrote:

> The 7th question should've been the first to rather obviate the need for
> some of the other 6. So, if the data is binary, MR is of little use anyway.
> Didn't understand and likely believe when you say this:
>
> "No, entire data is equally important and will be read together."
>
> Other than that, an 8th question:
> 8. how much read latency can the system tolerate?
>
> and a 9th:
> 9. what is the usable size of a unit of data being read? it being binary,
> does the entire stream have to be read to make sense of it for the
> application are parts of the binary usable?
>
>
> If you can get away with some read-latency, take a look at one of the
> commercial erasure coding solutions out there (like Cleversafe) or just
> code one yourself. Also, see:
> https://issues.apache.org/jira/browse/HDFS-503
>
> hth
>
>
>
> On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> Hello Gaurav,
>>
>>     Thank you so much for your reply. Please find my comments embedded
>> below :
>>
>> 1. do you know if there exist patterns in this data?
>> >> Yes, entire file is divided into data blocks of fixed length (But
>> there is no separator between 2 blocks).
>>
>> 2. will the data be read and how?
>> >> Yes, data has to be read. To be honest, we are still not sure how to
>> do that.
>>
>> 3. does there exist a hot subset of the data - both read/write?
>> >> No, entire data is equally important and will be read together.
>>
>> 4. what makes you think hdfs is a good option?
>> >> Distributed architecture, Flexibility to read any kind of data,
>> Parallelism, Native MR integration, Cost, Fault tolerance, High
>> throughput etc.
>>
>> 5. how much do you intend to pay per TB?
>> >> I have to discuss it with my superiors (Will let you know soon).
>>
>> 6. say you do build the system, how do you plan to keep lights on?
>> >> I am sorry I did not get this. I mean i'll do whatever it takes to
>> keep everything moving. I have some experience with small clusters. And I
>> have got a small team with me which is ready 24*7.
>>
>> 7. forgot to ask - is the data textual or binary?
>> >> Data is binary.
>>
>> No, I would require some help. I have a team with me as I have said. But
>> being new to Hadoop I would need some help from whatever source it is.
>>
>> Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <
>> gaurav.gs.sharma@gmail.com> wrote:
>>
>>> So, before getting any suggestions, will have to explain a few core
>>> things:
>>>
>>> 1. do you know if there exist patterns in this data?
>>> 2. will the data be read and how?
>>> 3. does there exist a hot subset of the data - both read/write?
>>> 4. what makes you think hdfs is a good option?
>>> 5. how much do you intend to pay per TB?
>>> 6. say you do build the system, how do you plan to keep lights on?
>>> 7. forgot to ask - is the data textual or binary?
>>>
>>> Those are just the basic questions. Are you going to be building and
>>> running the system all by yourself?
>>>
>>>
>>> On Nov 28, 2012, at 14:09, Mohammad Tariq <dontariq@gmail.com> wrote:
>>>
>>> > Hello list,
>>> >
>>> >      Although a lot of similar discussions have been done here, I
>>> still seek some of your able guidance. Till now I have worked only on small
>>> or mid-sized clusters. But this time situation is a bit different. I have
>>> to cpollect a lot of legacy data, stored over last few decades. This data
>>> is on tape drives and I have to collect it from there and store in my
>>> cluster. The size could go somewhere near 24 Petabytes (inclusive of
>>> replication).
>>> >
>>> > Now, I need some help to kick this off, like what could be the optimal
>>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>>> >
>>> > What should be the no. of slaves and ZK peers nodes keeping this
>>> config in mind?
>>> >
>>> > What is the optimal network config for a cluster of this size.
>>> >
>>> > Which kind of disks would be more efficient?
>>> >
>>> > Please do provide me some guidance as I want to have some expert
>>> comments before moving ahead. Many thanks.
>>> >
>>> > Regards,
>>> >     Mohammad Tariq
>>> >
>>>
>>
>>
>

Mime
View raw message