hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@cawoom.com>
Subject Re: Planning to propose Hadoop initiative to company. Need some inputs please.
Date Wed, 01 Oct 2014 18:02:54 GMT

first: I think hbase is what you are looking for. If I understand
correctly you want to show the customer his or her data very fast and
let them manipulate their data. So you need something like a data
warehouse system. Thus, hbase is the method of choice for you (and I
think for your kind of data, hbase is a better choice than cassandra or
mongoDB). But of course you need a running hadoop system to run a hbase.
So it's not an either/or ;)

(my answers are for hbase, as I think it's what you are looking for. If
you are not interested, just ignore the following text. Sry @all by
writing about hbase on this list ;).)

Am 01.10.2014 um 17:24 schrieb mani kandan:
> 1) How much web usage data will a typical website like ours collect on a
> daily basis? (I know I can ask our IT department, but I would like to
> gather some background idea before talking to them.)
well, if you have the option to ask your IT department you should do
that, because everyone here would have to guess. You would have to
explain very detailed what you have to do to let us guess. If you e.g.
want to track the user on what he or she has clicked, perhaps to make
personalized ads, than you have to save more data. So, you should ask
the persons who have the data right away without guessing.

> 3) How many clusters/nodes would I need to ​run a web usage analytics
> system?
in the book "hbase in action" there are some recommendations for some
"case studies" (part IV "deploying hbase"). There are some thoughts on
the number of nodes, and how to use them, depending on the size of your data

> 4) What are the ways for me to use our data? (One use case I'm thinking
> of is to analyze the error messages log for each page on quote process
> to redesign the UI. Is this possible?)
sure. And this should be very easy. I would pump the error log into a
hbase table. By this method you could read the messages directly from
the hbase shell (if they are few enough). Or you could use hive to query
your log a little more "sql like" and make statistics very easy.

> 5) How long would it take for me to set up and start such a system?
for a novice who have to do it for the first time: for the stand alone
hbase system perhaps 2 hours. For a complete distributed test cluster
... perhaps a day. For the real producing system, with all security
features ... a little longer ;).

> I'm sorry if some/all of these questions are unanswerable. I just want
> to discuss my thoughts, and get an idea of what things can I achieve by
> going the way of Hadoop.
well, I think, but I could err, that you think of hadoop (or hbase) in a
way that you just can change the "database backend" from "SQL" to
"hbase/hadoop" and everything would run right away. This will not be
that easy. You would have to change the code of your web application in
a very fundamental way. You have to rethink all the table designs etc.,
so this could be more complicate than you think right know.

However, hbase/hadoop hase some advantages which are very interesing for
you. Well first, it is distributed, which enables your company to grow
almost limitless, or to collect more data about your customers so you
can get more informations (and sell more stuff). And map reduce is a
wonderful tool for making real fancy "statistics", which is very
interesting for an insurance company. Your mathematical economist will
REALLY love it ;).

Hope this helped.

best wishes


View raw message