From core-commits-return-6779-apmail-hadoop-core-commits-archive=hadoop.apache.org@hadoop.apache.org Mon Nov 10 19:15:55 2008 Return-Path: Delivered-To: apmail-hadoop-core-commits-archive@www.apache.org Received: (qmail 73203 invoked from network); 10 Nov 2008 19:15:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Nov 2008 19:15:55 -0000 Received: (qmail 26025 invoked by uid 500); 10 Nov 2008 19:16:02 -0000 Delivered-To: apmail-hadoop-core-commits-archive@hadoop.apache.org Received: (qmail 25998 invoked by uid 500); 10 Nov 2008 19:16:02 -0000 Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-commits@hadoop.apache.org Received: (qmail 25989 invoked by uid 99); 10 Nov 2008 19:16:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2008 11:16:02 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2008 19:14:41 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 2D8F5118DC for ; Mon, 10 Nov 2008 19:15:24 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: core-commits@hadoop.apache.org Date: Mon, 10 Nov 2008 19:15:23 -0000 Message-ID: <20081110191523.11005.88603@eos.apache.org> Subject: [Hadoop Wiki] Update of "Hbase/PoweredBy" by jgray X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The following page has been changed by jgray: http://wiki.apache.org/hadoop/Hbase/PoweredBy ------------------------------------------------------------------------------ [http://www.mahalo.com Mahalo], "...the world's first human-powered search engine". All the markup that powers the wiki is stored in HBase. It's been in use for a few months now. !MediaWiki - the same software that power Wikipedia - has version/revision control. Mahalo's in-house editors produce a lot of revisions per day, which was not working well in a RDBMS. An hbase-based solution for this was built and tested, and the data migrated out of MySQL and into HBase. Right now it's at something like 6 million items in HBase. The upload tool runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10 minutes to run - and does not slow down production at all. [http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store raw documents. We have a ~70 node hadoop cluster running DFS, mapreduce, and hbase. In our wikipedia hbase table, we have one row for each wikipedia page (~2.5M pages and climbing). We use this as input to our indexing jobs, which are run in hadoop mapreduce. Uploading the entire wikipedia dump to our cluster takes a couple hours. Scanning the table inside mapreduce is very fast -- the latency is in the noise compared to everything else we do. + + [http://www.streamy.com/ Streamy] is a recently launched realtime social news site. We use HBase for all of our data storage, query, and analysis needs, replacing an existing SQL-based system. This includes hundreds of millions of documents, sparse matrices, logs, and everything else once done in the relational system. We perform significant in-memory caching of query results similar to a traditional Memcached/SQL setup as well as other external components to perform joining and sorting. We also run thousands of daily MapReduce jobs using HBase tables for log analysis, attention data processing, and feed crawling. HBase has helped us scale and distribute in ways we could not otherwise, and the community has provided consistent and invaluable assistance. [http://www.subrecord.org SubRecord Project] is an Open Source project that is using HBase as a repository of records (persisted map-like data) for the aspects it provides like logging, tracing or metrics. HBase and Lucene index both constitute a repo/storage for this platform.