hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Antoine DuBoDeNa <pad...@gmail.com>
Subject Re: is hadoop suitable for us?
Date Fri, 18 May 2012 04:10:00 GMT
You used HDFS too? or storing everything on SAN immediately?

I don't have number of GB/TB (it might be about 2TB so not really that
"huge") but they are more than 100 million documents to be processed. In a
single machine currently we can process about 200.000 docs/day (several
parsing, indexing, metadata extraction has to be done). So in the worst
case we want to use the 50 VMs to distribute the processing..

2012/5/17 Sagar Shukla <sagar_shukla@persistent.co.in>

> Hi PA,
>     In my environment, we had a SAN storage and I/O was pretty good. So if
> you have similar environment then I don't see any performance issues.
>
> Just out of curiosity - what amount of data are you looking forward to
> process ?
>
> Regards,
> Sagar
>
> -----Original Message-----
> From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com]
> Sent: Thursday, May 17, 2012 8:29 PM
> To: common-user@hadoop.apache.org
> Subject: Re: is hadoop suitable for us?
>
> Thanks Sagar, Mathias and Michael for your replies.
>
> It seems we will have to go with hadoop even if I/O will be slow due to
> our configuration.
>
> I will try to update on how it worked for our case.
>
> Best,
> PA
>
>
>
> 2012/5/17 Michael Segel <michael_segel@hotmail.com>
>
> > The short answer is yes.
> > The longer answer is that you will have to account for the latencies.
> >
> > There is more but you get the idea..
> >
> > Sent from my iPhone
> >
> > On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <
> > padbdn@gmail.com> wrote:
> >
> > > We have large amount of text files that we want to process and index
> > (plus
> > > applying other algorithms).
> > >
> > > The problem is that our configuration is share-everything while
> > > hadoop
> > has
> > > a share-nothing configuration.
> > >
> > > We have 50 VMs and not actual servers, and these share a huge
> > > central storage. So using HDFS might not be really useful as
> > > replication will not help, distribution of files have no meaning as
> > > all files will be again located in the same HDD. I am afraid that
> > > I/O will be very slow with or without HDFS. So i am wondering if it
> > > will really help us to use hadoop/hbase/pig etc. to distribute and
> > > do several parallel tasks.. or is "better" to install something
> > > different (which i am not sure what). We heard myHadoop is better
> > > for such kind of configurations, have any clue about it?
> > >
> > > For example we now have a central mySQL to check if we have already
> > > processed a document and keeping there several metadata. Soon we
> > > will
> > have
> > > to distribute it as there is not enough space in one VM, But
> > > Hadoop/HBase will be useful? we don't want to do any complex
> > > join/sort of the data, we just want to do queries to check if
> > > already processed a document, and if not to add it with several of
> it's metadata.
> > >
> > > We heard sungrid for example is another way to go but it's
> > > commercial. We are somewhat lost.. so any help/ideas/suggestions are
> appreciated.
> > >
> > > Best,
> > > PA
> > >
> > >
> > >
> > > 2012/5/17 Abhishek Pratap Singh <manu.infy@gmail.com>
> > >
> > >> Hi,
> > >>
> > >> For your question if HADOOP can be used without HDFS, the answer is
> Yes.
> > >> Hadoop can be used with any kind of distributed file system.
> > >> But I m not able to understand the problem statement clearly to
> > >> advice
> > my
> > >> point of view.
> > >> Are you processing text file and saving in distributed database??
> > >>
> > >> Regards,
> > >> Abhishek
> > >>
> > >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
> > >> < padbdn@gmail.com> wrote:
> > >>
> > >>> We want to distribute processing of text files.. processing of
> > >>> large machine learning tasks, have a distributed database as we
> > >>> have big
> > amount
> > >>> of data etc.
> > >>>
> > >>> The problem is that each VM can have up to 2TB of data (limitation
> > >>> of
> > >> VM),
> > >>> and we have 20TB of data. So we have to distribute the processing,
> > >>> the database etc. But all those data will be in a shared huge
> > >>> central file system.
> > >>>
> > >>> We heard about myHadoop, but we are not sure why is that any
> > >>> different
> > >> from
> > >>> Hadoop.
> > >>>
> > >>> If we run hadoop/mapreduce without using HDFS? is that an option?
> > >>>
> > >>> best,
> > >>> PA
> > >>>
> > >>>
> > >>> 2012/5/17 Mathias Herberts <mathias.herberts@gmail.com>
> > >>>
> > >>>> Hadoop does not perform well with shared storage and vms.
> > >>>>
> > >>>> The question should be asked first regarding what you're trying
> > >>>> to
> > >>> achieve,
> > >>>> not about your infra.
> > >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> > >>>> padbdn@gmail.com> wrote:
> > >>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> We have about 50 VMs and we want to distribute processing across
> > >> them.
> > >>>>> However these VMs share a huge data storage system and thus
> > >>>>> their
> > >>>> "virtual"
> > >>>>> HDD are all located in the same computer. Would Hadoop be useful
> > >>>>> for
> > >>> such
> > >>>>> configuration? Could we use hadoop without HDFS? so that we
can
> > >>> retrieve
> > >>>>> and store everything in the same storage?
> > >>>>>
> > >>>>> Thanks,
> > >>>>> PA
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message