hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sagar Shukla <sagar_shu...@persistent.co.in>
Subject RE: is hadoop suitable for us?
Date Thu, 17 May 2012 23:01:52 GMT
Hi PA,
       Thanks for the detailed explanation of your environment.

Based on some of my experiences with Hadoop so far, following is my recommendation:
If you plan to process huge documents regularly and generate the index of the metadata, then
hadoop is the way to do. I am not sure about the frequency and the size of the data that you
are talking about. Generally, Hadoop is used where you need to process GBs and TBs of data
at regular intervals.

As far as storage is concerned, it can be used in multiple ways. It is not necessary that
you process the data and store it in HDFS only. You should be able to output the indexes /
metadata and store it on the filesystem as well. If you intend to use HDFS for distributed
redundancy capabilities of Hadoop and if you have SAN storage then you can create LUNs for
each of the VMs and mount them, so that though the data is stored on a single storage, but
is visible as distributed to the VMs. Though being a single storage, it provided distributed
and fast processing capabilities through the use of VMs.

Hope this helps.


-----Original Message-----
From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com] 
Sent: Thursday, May 17, 2012 6:33 PM
To: common-user@hadoop.apache.org
Subject: Re: is hadoop suitable for us?

We have large amount of text files that we want to process and index (plus applying other

The problem is that our configuration is share-everything while hadoop has a share-nothing

We have 50 VMs and not actual servers, and these share a huge central storage. So using HDFS
might not be really useful as replication will not help, distribution of files have no meaning
as all files will be again located in the same HDD. I am afraid that I/O will be very slow
with or without HDFS. So i am wondering if it will really help us to use hadoop/hbase/pig
etc. to distribute and do several parallel tasks.. or is "better" to install something different
(which i am not sure what). We heard myHadoop is better for such kind of configurations, have
any clue about it?

For example we now have a central mySQL to check if we have already processed a document and
keeping there several metadata. Soon we will have to distribute it as there is not enough
space in one VM, But Hadoop/HBase will be useful? we don't want to do any complex join/sort
of the data, we just want to do queries to check if already processed a document, and if not
to add it with several of it's metadata.

We heard sungrid for example is another way to go but it's commercial. We are somewhat lost..
so any help/ideas/suggestions are appreciated.


2012/5/17 Abhishek Pratap Singh <manu.infy@gmail.com>

> Hi,
> For your question if HADOOP can be used without HDFS, the answer is Yes.
> Hadoop can be used with any kind of distributed file system.
> But I m not able to understand the problem statement clearly to advice 
> my point of view.
> Are you processing text file and saving in distributed database??
> Regards,
> Abhishek
> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois < 
> padbdn@gmail.com> wrote:
> > We want to distribute processing of text files.. processing of large 
> > machine learning tasks, have a distributed database as we have big 
> > amount of data etc.
> >
> > The problem is that each VM can have up to 2TB of data (limitation 
> > of
> VM),
> > and we have 20TB of data. So we have to distribute the processing, 
> > the database etc. But all those data will be in a shared huge 
> > central file system.
> >
> > We heard about myHadoop, but we are not sure why is that any 
> > different
> from
> > Hadoop.
> >
> > If we run hadoop/mapreduce without using HDFS? is that an option?
> >
> > best,
> > PA
> >
> >
> > 2012/5/17 Mathias Herberts <mathias.herberts@gmail.com>
> >
> > > Hadoop does not perform well with shared storage and vms.
> > >
> > > The question should be asked first regarding what you're trying to
> > achieve,
> > > not about your infra.
> > > On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" < 
> > > padbdn@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > We have about 50 VMs and we want to distribute processing across
> them.
> > > > However these VMs share a huge data storage system and thus 
> > > > their
> > > "virtual"
> > > > HDD are all located in the same computer. Would Hadoop be useful 
> > > > for
> > such
> > > > configuration? Could we use hadoop without HDFS? so that we can
> > retrieve
> > > > and store everything in the same storage?
> > > >
> > > > Thanks,
> > > > PA
> > > >
> > >
> >

This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.

View raw message