hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Herou <marcus.he...@tailsweep.com>
Subject Re: Parallell maps
Date Thu, 02 Jul 2009 20:09:20 GMT

Guys whoa slow down! We are on the same side and agree on most things :)

We are using hadoop in these cases.

1. Prospect Crawling: A job exports a few hundred thousand urls from a db
into a flat file and then lets a threaded MapRunnable go crazy crawling on
that one and save the fetched data and response to glusterFS during runtime.
In the end of each url's crawl the state is then moved from "fetchable" to
"parsable" and mapped back into the db. This is done during runtime but
should be done afterwards.
2. Prospect Parsing: Again export the urls from the db in question into
hadoop where we process them and perform ngramanalysis, spam detection, blog
detection etc and if everyting goes well marks the url as a valid blog and
updates the state accordingly, again during runtime, should be done
3. Blog Crawling: Fetch the feed urls, basically same as case1 but here we
fetch the feed urls instead.
4. Blog Parsing: Parse the feed url into a SyndFeed object (rome project)
and process the SyndFeed and all it's SyndEntries and persist them into a DB
during runtime. This could be improved by emitting the entries instead of
saving the to a DB runtime. Each new entry notifies a indexer Q.
5. Indexing: The indexer polls the Q, loads the id's into a threaded
MapRunnable and go crazy updating SOLR.

It is basically an extension and refinement of Nutch where we have refined
the plugin architecture to use Spring, created wrappers around some existing
plugins and chnage the storage solution from the CrawlDB and LinkDB to use a
database instead.

Log file analysis:
We are amongb things a blog advertising company so statistics is key. We use
Hadoop for 95% of the processing of the incoming access logs, well not
access logs but very alike and then emits the data into MonetDB which have
great speed (but bad robustness) for grouping, counting etc.

The end users are then using SOLR for searching and our sharded DB solution
to fetch the item in question, the blog entry, the blog itself and so on.
Perhaps we will use HBase but after testing last year I ruled HBase out due
to performance since I got more bang for the buck writing my own arch, sorry

The architecture is quite good but just needs some tuning which you pointed
out. Whenever you can emit a record to the outputCollector instead of
updating the DB then you should.

We now have 90 million blog entries in the DB (in 4 months) and have
prospected over a billion urls so we are doing something right I hope.

The originating question seem to have got out of scope haha.

Cheers lads


On Thu, Jul 2, 2009 at 5:08 PM, Uri Shani <SHANI@il.ibm.com> wrote:

> Hi Marcus,
> If you need a database to serve as an in-between to your customer, than
> you can do massive load to DB of the end results - as I suggested.
> You can avoid that and work with files only. Than you need to serve your
> customers via an API to your hdfs files. Each query will start a M/R job
> to get the results.
> Don't know the response time - yet it is a viable approach to search.
> There are also the hadoop DBs: Hbase and Hive. PIG may also have SQL like
> things in the future, and there is the JAQL project http://www.jaql.org/.
> - Uri
> From:
> Marcus Herou <marcus.herou@tailsweep.com>
> To:
> common-user@hadoop.apache.org
> Date:
> 02/07/2009 05:50 PM
> Subject:
> Re: Parallell maps
> Hi.
> Yep I recon this. As I said, I have created a sharded DB solution where
> all
> data is currently spread across 50 shards. I totally agree that one should
> try to emit data to the outputCollector but when one have the data in
> memory
> I do not want to throw it away and re-read it into memory again later on
> after the job is done by chained Hadoop jobs... In our stats system we do
> exactly this, almost no db all over the complex chain, just reducing and
> counting like crazy :)
> There is another pro following your example and that is that you risk less
> during the actual job since you do not have to create a db-pool of
> thousands
> of connections. And the jobs do not affect the live production DB = good.
> Anyway what do you mean that if you HAVE to load it into a db ? How the
> heck
> are my users gonna access our search engine without accessing a populated
> index or DB ?
> I've been thinking about this in many use-cases and perhaps I am thinking
> totally wrong but I tend to not be able to implement a shared-nothing arch
> all over the chain it is simply not possible. Sometimes you need a DB,
> sometimes you need an index, sometimes a distributed cache, sometimes a
> file
> on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS
> due
> to the non wrapper characteristict of HDFS and it's speed, random access
> IO
> that is).
> I will put a mental note to adapt and be more hadoopish :)
> Cheers
> //Marcus
> On Thu, Jul 2, 2009 at 4:27 PM, Uri Shani <SHANI@il.ibm.com> wrote:
> > Hi,
> > Whenever you try to access DB in parallel you need to design it for
> that.
> > This means, for instance, that you ensure that each of the parallel
> tasks
> > inserts to a distinct "partition" or table in the database to avoid the
> > conflicts and failures. Hadoop does the same in its reduce tasks - each
> > reduce gets ALL the records that are needed to do its function. So there
> > is a hash mapping keys to these tasks. Same principle you need to follow
> > when linking DB partitions to your hadoop tasks.
> >
> > In general, think how to not use DB inserts from Hadoop. Rather, create
> > your results in files.
> > At the end of the process, you can - if this is what you HAVE to do -
> > massively load all the records into the database using efficient loading
> > utilities.
> >
> > If you need the DB to communicate among your tasks, meaning that you
> need
> > the inserts to be readily available for other threads to select, than it
> > is obviously the wrong media for such sharing and you need to look at
> > other solutions to share consistent data among hadoop tasks. For
> instance,
> > zookeeper, etc.
> >
> > Regards,
> > - Uri
> >
> >
> >
> > From:
> > Marcus Herou <marcus.herou@tailsweep.com>
> > To:
> > common-user <common-user@hadoop.apache.org>
> > Date:
> > 02/07/2009 12:13 PM
> > Subject:
> > Parallell maps
> >
> >
> >
> > Hi.
> >
> > I've noticed that hadoop spawns parallell copies of the same task on
> > different hosts. I've understood that this is due to improve the
> > performance
> > of the job by prioritizing fast running tasks. However since we in our
> > jobs
> > connect to databases this leads to conflicts when inserting, updating,
> > deleting data (duplicated key etc). Yes I know I should consider Hadoop
> as
> > a
> > "Shared Nothing" architecture but I really must connect to databases in
> > the
> > jobs. I've created a sharded DB solution which scales as well or I would
> > be
> > doomed...
> >
> > Any hints of how to disable this feature or howto reduce the impact of
> it
> > ?
> >
> > Cheers
> >
> > /Marcus
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> >
> >
> >
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/

Marcus Herou CTO and co-founder Tailsweep AB

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message