Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 44560 invoked from network); 5 Nov 2009 05:33:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Nov 2009 05:33:43 -0000 Received: (qmail 98444 invoked by uid 500); 5 Nov 2009 05:33:40 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 98350 invoked by uid 500); 5 Nov 2009 05:33:39 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 98332 invoked by uid 99); 5 Nov 2009 05:33:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Nov 2009 05:33:39 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Nov 2009 05:33:35 +0000 Received: by ewy26 with SMTP id 26so7528913ewy.29 for ; Wed, 04 Nov 2009 21:33:13 -0800 (PST) MIME-Version: 1.0 Received: by 10.213.24.22 with SMTP id t22mr1904170ebb.27.1257399193130; Wed, 04 Nov 2009 21:33:13 -0800 (PST) In-Reply-To: References: <8880C7C0-670E-4218-B219-CA5A2587B9B6@beforedawnsolutions.com> <709d5bba0911031638p6ba282cal5fd7c66f5680a4f7@mail.gmail.com> <709d5bba0911040748g6b5b9276m930daa852816a9ae@mail.gmail.com> From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= Date: Thu, 5 Nov 2009 07:32:52 +0200 Message-ID: <709d5bba0911042132l3731dd42me4424391f83152a5@mail.gmail.com> Subject: Re: AW: Web Analytics Use case? To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0ce0d66a3f591b04779910e1 --000e0ce0d66a3f591b04779910e1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ricky, You're absolutely right, I've already started developing a new data collection system that populates sessions on the fly. Till that development is finished, I felt like I needed to develop a layered approach of abstractions. Session aggregation might override the initial Hadoop run at the end of the day. The case with these log files are that, they are just collections of past 5 years. When I started this, (5 years ago) I had no idea of what I would be facing :) On Wed, Nov 4, 2009 at 11:24 PM, Ricky Ho wrote: > Why can't you do the session specific calculation and aggregation at the > spot where session data is gathered ? > > > One of the main usage of Map/Reduce is when the aggregation is done acros= s > a very scattered data set. But it looks like the kind of processing you > describe is very localized. I mean the same session pretty much hitting = the > same server, so you can do the aggregation at the same spot. > > Rgds, > Ricky > > -----Original Message----- > From: Utku Can Top=C3=A7u [mailto:utku@topcu.gen.tr] > Sent: Wednesday, November 04, 2009 7:48 AM > To: common-user@hadoop.apache.org > Subject: Re: AW: Web Analytics Use case? > > The reason why I'm choosing DB loading is the fact that, each Session (WA= A > calls this Visit) is composed of multiple Events, where an Event is a lin= e > of Log File. For every session, we can be sure that, each session is > basically composed of one or more lines of log, that results in duplicati= on > of many session constants (i.e IP Addr, User-Agent, Unique Visitor cookie= , > etc..). In addition to this, the Reducer goes over the sessions once agai= n > before loading them to the DB so that we can compute some session specifi= c > calculations on the fly. > > I hope I was precise and clear enough in expressing the design choice of > mine. > > Regards, > Utku > > On Wed, Nov 4, 2009 at 4:41 PM, Ricky Ho wrote: > > > Good point. Hadoop can be used as a distributed DB loader. > > > > Just curious. How would you compare this with directly write to HBase > (by > > passing the log, Hadoop step) ? > > > > Rgds, > > Ricky > > > > -----Original Message----- > > From: Utku Can Top=C3=A7u [mailto:utku@topcu.gen.tr] > > Sent: Tuesday, November 03, 2009 4:39 PM > > To: common-user@hadoop.apache.org > > Subject: Re: AW: Web Analytics Use case? > > > > Hey, > > > > Hadoop, HBase and Hive really scale for web analytics, I have been into > web > > analytics using Hadoop for more than a year. > > > > In my case, I periodically rotate logs and put them on the HDFS. (I > should > > think of writing directly to HDFS; but it's not a critical issue for me > > right now.) > > When the log files are on the HDFS somehow, a single map/reduce job run= s > on > > the newly introduced data line by line. > > > > The key point here is, we need to think web analytics as a series of > > abstraction on the raw data. Each abstraction analysis might symbolize = a > > map/reduce job. > > > > The big question arises just right here, What does the initial analysis > do > > for the log files. > > > > Abstraction #1: > > I assume each log line either represents a pageview or an event; we can > > generalize an event as a pageview too, and surely I will do so! > > An event comes with some valuable information such as, > > [Session identifier, unique visitor identifier, browser and locale > related > > data, page related data, location related data, etc...] > > > > Abstraction #2: > > Our map/reduce job should map an Event to a Session Event in order to g= et > a > > newer abstraction on the raw data. Session Events should be reduced int= o > > Sessions with respect to the Session Identifiers as keys. > > At the end of the first abstraction, we have our session data sorted ou= t > as > > (key,value) pairs, where the keys are the Session Identifiers, and > > presumably the Values should be the Sessions. Which means, now we can > store > > Sessions in a Key/Value database that in this case conforms to HBase. > > > > One can think of additional abstractions from this point I think, I can > > come > > up with many ideas some of which are fairly mature and some are just > dreams > > and/or premature thoughts. > > > > Regards, > > Utku > > > > > > On Tue, Nov 3, 2009 at 9:14 PM, John Martyniak < > > john@beforedawnsolutions.com > > > wrote: > > > > > Benjamin, > > > > > > Well instead of SQL you have code that you can use to manipulate the > > data. > > > If it was possible I would see if there was some way that you can > > > pre-process as much of the data as possible to put into HBase, and th= en > > use > > > any additional Map/Reduce jobs to provide any additional > customizations. > > > > > > I don't think that you can "replace" the RDBMS without re-visualizing > the > > > data, meaning that you will need to re-model it so that it fits into > > HBase > > > architecture, which means no relationships. > > > > > > By the way most of this can be done, it just requires some work, and = a > > > rethinking of the way that you do things, both for Map/Reduce and > HBase. > > > > > > -John > > > > > > > > > > > > On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote: > > > > > > Hi John, > > >> > > >> Thanks a lot for the fast answer. I was unsure because we would like > to > > >> avoid aggregating the data so that our users can come up with all > kinds > > of > > >> filters and conditions for your queries and always drill down to > single > > >> users of their website. I am not sure how this works when SQL is not > > >> directly available? We are currently using complex sql queries for > this, > > >> these would then have to be rewritten in form of Map/reduce tasks > which > > >> provide the final result? > > >> > > >> Or how would one go about to actually replace an RDBMS system? > > >> > > >> Thanks a lot, > > >> Benjamin > > >> > > >> > > >> > > >> _______________________________________ > > >> Benjamin Dageroth, Business Development Manager > > >> Webtrekk GmbH > > >> Boxhagener Str. 76-78, 10245 Berlin > > >> fon 030 - 755 415 - 360 > > >> fax 030 - 755 415 - 100 > > >> benjamin.dageroth@webtrekk.com > > >> http://www.webtrekk.com > > >> Amtsgericht Berlin, HRB 93435 B > > >> Gesch=C3=A4ftsf=C3=BChrer Christian Sauer > > >> > > >> > > >> _______________________________________ > > >> > > >> > > >> -----Urspr=C3=BCngliche Nachricht----- > > >> Von: John Martyniak [mailto:john@beforedawnsolutions.com] > > >> Gesendet: Dienstag, 3. November 2009 15:09 > > >> An: common-user@hadoop.apache.org > > >> Betreff: Re: Web Analytics Use case? > > >> > > >> Benjamin, > > >> > > >> That is kind of the exact case for Hadoop. > > >> > > >> Hadoop is a system that is built for handling very large datasets, a= nd > > >> delivering processed results. HBase is built for AdHoc data, so > > >> instead of having complicated table joins etc, you have very large > > >> rows (multiple columns) with aggregate data, then use HBase to retur= n > > >> results from that. > > >> > > >> We currently use hadoop/hbase to collect and process lots of data, > > >> then take the results from the processing to populate a SOLR Index, > > >> and a MySQL database which is then used to feed the front ends. It > > >> seems to work pretty good in that it greatly reduces the number of > > >> rows and the size of the queries in the DB/index. > > >> > > >> We are exploring using HBase to feed the front-ends in place of the > > >> MySQL DBs, so far the jury is out on the performance but it does loo= k > > >> promising. > > >> > > >> -John > > >> > > >> > > >> > > >> On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote: > > >> > > >> Hi, > > >>> > > >>> I am currently evalutating whether Hadoop might be an alternative t= o > > >>> our current system. We are providing a web analytics solution for > > >>> very large websites and run every analysis on all collected data - > > >>> we do not aggregate the data. This results in very large amounts of > > >>> data that are processed for each query and currently we are using a= n > > >>> in memory database by Exasol with really a lot of RAM, so that it > > >>> does not take longer than a few seconds and for more complicated > > >>> queries not longer than a minute to deliever the results. > > >>> > > >>> The solution however is quite expensive and given the growth of dat= a > > >>> I'd like to explore alternatives. I have read about NoSQL Datastore= s > > >>> and about Hadoop, but I am not sure whether it is actually a choice > > >>> for our web analytics solution. We are collecting data via a > > >>> trackingpixel which gives data to a trackingserver which writes it > > >>> to disk once the session of a visitor is done. Our current solution > > >>> has a large number of tables and the queries running the data can b= e > > >>> quite complex: > > >>> > > >>> How many user who came over that keyword and were from that city di= d > > >>> actually buy the advertised product? Of these users, what other > > >>> pages did they look at. Etc. > > >>> > > >>> Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps > > >>> Mahout? > > >>> > > >>> Thanks for any thoughts, > > >>> Benjamin > > >>> > > >>> _______________________________________ > > >>> Benjamin Dageroth, Business Development Manager > > >>> Webtrekk GmbH > > >>> Boxhagener Str. 76-78, 10245 Berlin > > >>> fon 030 - 755 415 - 360 > > >>> fax 030 - 755 415 - 100 > > >>> benjamin.dageroth@webtrekk.com > > >>> http://www.webtrekk.com > > >>> Amtsgericht Berlin, HRB 93435 B > > >>> Gesch=C3=A4ftsf=C3=BChrer Christian Sauer > > >>> > > >>> > > >>> _______________________________________ > > >>> > > >>> > > >>> > > >> > > > > > > --000e0ce0d66a3f591b04779910e1--