Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D3F1C70D6 for ; Sun, 16 Oct 2011 06:48:36 +0000 (UTC) Received: (qmail 2023 invoked by uid 500); 16 Oct 2011 06:48:36 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 1703 invoked by uid 500); 16 Oct 2011 06:48:32 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 1680 invoked by uid 500); 16 Oct 2011 06:48:28 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 1673 invoked by uid 99); 16 Oct 2011 06:48:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Oct 2011 06:48:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Oct 2011 06:48:20 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id B576BE0F; Sun, 16 Oct 2011 06:47:58 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sun, 16 Oct 2011 06:47:58 -0000 Message-ID: <20111016064758.51090.57968@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22PoweredBy=22_by_nicolas=2Ebrouss?= =?utf-8?q?e?= Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "PoweredBy" page has been changed by nicolas.brousse: http://wiki.apache.org/hadoop/PoweredBy?action=3Ddiff&rev1=3D355&rev2=3D356 Comment: Add TubeMogul, Inc. = =3D A =3D * [[http://a9.com/|A9.com]] - Amazon* - = * We build Amazon's product search indices using the streaming API and = pre-existing C++, Perl, and Python tools. * We process millions of sessions daily for analytics, using both the J= ava and streaming APIs. * Our clusters vary from 1 to 100 nodes. = * [[http://www.accelacommunications.com|Accela Communications]] - = * We use a Hadoop cluster to rollup registration and view data each nig= ht. * Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives * Each night, we run 112 Hadoop jobs * It is roughly 4X faster to export the transaction tables from each of= our reporting databases, transfer the data to the cluster, perform the rol= lups, then import back into the databases than to perform the same rollups = in the database. = * [[http://www.adobe.com|Adobe]] - = * We use Hadoop and HBase in several areas from social services to stru= ctured data storage and processing for internal use. * We currently have about 30 nodes running HDFS, Hadoop and HBase in cl= usters ranging from 5 to 14 nodes on both production and development. We pl= an a deployment on an 80 nodes cluster. * We constantly write data to HBase and run MapReduce jobs to process t= hen store it back to HBase or external systems. * Our production cluster has been running since Oct 2008. = * [[http://www.adyard.de|adyard]] - = * We use Flume, Hadoop and Pig for log storage and report generation as= well as ad-Targeting. * We currently have 12 nodes running HDFS and Pig and plan to add more = from time to time. * 50% of our recommender system is pure Pig because of it's ease of use. * Some of our more deeply-integrated tasks are using the streaming api = and ruby aswell as the excellent Wukong-Library. = * [[http://www.ablegrape.com/|Able Grape]] - Vertical search engine for = trustworthy wine information - = * We have one of the world's smaller hadoop clusters (2 nodes @ 8 CPUs/= node) * Hadoop and Nutch used to analyze and index textual information = * [[http://adknowledge.com/|Adknowledge]] - Ad network - = * Hadoop used to build the recommender system for behavioral targeting,= plus other clickstream analytics * We handle 500MM clickstream events per day * Our clusters vary from 50 to 200 nodes, mostly on EC2. * Investigating use of R clusters atop Hadoop for statistical analysis = and modeling at scale. = * [[http://www.aguja.de|Aguja]]- E-Commerce Data analysis - = * We use hadoop, pig and hbase to analyze search log, product view dat= a, and analyze all of our logs * 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each. = * [[http://china.alibaba.com/|Alibaba]] - = * A 15-node cluster dedicated to processing sorts of business data dump= ed out of database and joining them together. These data will then be fed i= nto iSearch, our vertical search engine. * Each node has 8 cores, 16G RAM and 1.4T storage. = * [[http://aol.com/|AOL]] - = * We use hadoop for variety of things ranging from ETL style processing= and statistics generation to running advanced algorithms for doing behavio= ral analysis and targeting. * The Cluster that we use for mainly behavioral analysis and targeting = has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ra= m and 800 GB hard-disk. = * [[http://www.ara.com.tr/|ARA.COM.TR]] - Ara Com Tr - Turkey's first an= d only search engine - = * We build Ara.com.tr search engine using the Python tools. * We use Hadoop for analytics. * We handle about 400TB per month * Our clusters vary from 10 to 100 nodes = * [[http://atbrox.com/|Atbrox]] - = * We use hadoop for information extraction & search, and data analysis = consulting * Cluster: we primarily use Amazon's Elastic Mapreduce = * [[http://www.ABC-Online-Shops.de/|ABC Online Shops]] - = * Shop the Internet search engine = * [[http://www.aflam-online.com/|=D8=A7=D9=81=D9=84=D8=A7=D9=85 =D8=A7=D9= =88=D9=86 =D9=84=D8=A7=D9=8A=D9=86]] @@ -88, +76 @@ = =3D B =3D * [[http://www.babacar.org/|BabaCar]] - = * 4 nodes cluster (32 cores, 1TB). * We use Hadoop for searching and analysis of millions of rental bookin= gs. = * [[http://www.backdocsearch.com|backdocsearch.com]] - search engine for= chiropractic information, local chiropractors, products and schools = * [[http://www.baidu.cn|Baidu]] - the leading Chinese language search en= gine - = * Hadoop used to analyze the log of search and do some mining work on w= eb page database * We handle about 3000TB per week * Our clusters vary from 10 to 500 nodes * Hypertable is also supported by Baidu = * [[http://www.beebler.com|Beebler]] - = * 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RA= M) * We use hadoop for matching dating profiles = * [[http://www.benipaltechnologies.com|Benipal Technologies]] - Outsourc= ing, Consulting, Innovation - = * 35 Node Cluster (Core2Quad Q9400 Processor, 4-8 GB RAM, 500 GB HDD) * Largest Data Node with Xeon E5420*2 Processors, 64GB RAM, 3.5 TB HDD * Total Cluster capacity of around 20 TB on a gigabit network with fail= over and redundancy * Hadoop is used for internal data crunching, application development, = testing and getting around I/O limitations = * [[http://bixolabs.com/|Bixo Labs]] - Elastic web mining - = * The Bixolabs elastic web mining platform uses Hadoop + Cascading to q= uickly build scalable web mining applications. * We're doing a 200M page/5TB crawl as part of the [[http://bixolabs.co= m/datasets/public-terabyte-dataset-project/|public terabyte dataset project= ]]. * This runs as a 20 machine [[http://aws.amazon.com/elasticmapreduce/|E= lastic MapReduce]] cluster. = * [[http://www.brainpad.co.jp|BrainPad]] - Data mining and analysis - = * We use Hadoop to summarize of user's tracking data. * And use analyzing. = =3D C =3D * [[http://caree.rs/|Caree.rs]] - = * Hardware: 15 nodes * We use Hadoop to process company and job data and run Machine learnin= g algorithms for our recommendation engine. = * [[http://www.cdunow.de/|CDU now!]] - = * We use Hadoop for our internal searching, filtering and indexing = * [[http://www.charlestontraveler.com/|Charleston]] - = * Hardware: 15 nodes * We use Hadoop to process company and job data and run Machine learnin= g algorithms for our recommendation engine. = * [[http://www.cloudspace.com/|Cloudspace]] - = * Used on client projects and internal log reporting/parsing systems de= signed to scale to infinity and beyond. * Client project: Amazon S3-backed, web-wide analytics platform * Internal: cross-architecture event log aggregation & processing = * [[http://www.contextweb.com/|Contextweb]] - Ad Exchange - = * We use Hadoop to store ad serving logs and use it as a source for ad = optimizations, analytics, reporting and machine learning. * Currently we have a 50 machine cluster with 400 cores and about 140TB= raw storage. Each (commodity) node has 8 cores and 16GB of RAM. = * [[http://www.cooliris.com|Cooliris]] - Cooliris transforms your browse= r into a lightning fast, cinematic way to browse photos and videos, both on= line and on your hard drive. - = * We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB= ram, and 3-4 TB of storage. * We use Hadoop for all of our analytics, and we use Pig to allow PMs a= nd non-engineers the freedom to query the data in an ad-hoc manner. = * [[http://www.weblab.infosci.cornell.edu/|Cornell University Web Lab]] - = * Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB = RAM, 72GB Hard Drive) = * [[http://www.crs4.it|CRS4]] - = * [[http://dx.doi.org/10.1109/ICPPW.2009.37|Computational biology appli= cations]] * [[http://www.springerlink.com/content/np5u8k1x9l6u755g|HDFS as a VM r= epository for virtual clusters]] = * [[http://crowdmedia.de/|crowdmedia]] - = * Crowdmedia has a 5 Node Hadoop cluster for statistical analysis * We use Hadoop to analyse trends on Facebook and other social networks = =3D D =3D * [[http://datagraph.org/|Datagraph]] - = * We use Hadoop for batch-processing large [[http://www.w3.org/RDF/|RDF= ]] datasets, in particular for indexing RDF data. * We also use Hadoop for executing long-running offline [[http://en.wik= ipedia.org/wiki/SPARQL|SPARQL]] queries for clients. * We use Amazon S3 and Cassandra to store input RDF datasets and output= files. @@ -180, +152 @@ * We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluste= r sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of = millions to billions of RDF statements). = * [[http://www.deepdyve.com|Deepdyve]] - = * Elastic cluster with 5-80 nodes * We use hadoop to create our indexes of deep web content and to provid= e a high availability and high bandwidth storage service for index shards f= or our search cluster. = * [[http://www.wirtschaftsdetektei-berlin.de|Detektei Berlin]] - = * We are using Hadoop in our data mining and multimedia/internet resear= ch groups. * 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each. = * [[http://search.detik.com|Detikcom]] - Indonesia's largest news portal - = * We use hadoop, pig and hbase to analyze search log, generate Most Vie= w News, generate top wordcloud, and analyze all of our logs * Currently We use 9 nodes = * [[http://www.dropfire.com|DropFire]] - = * We generate Pig Latin scripts that describe structural and semantic c= onversions between data contexts * We use Hadoop to execute these scripts for production-level deploymen= ts * Eliminates the need for explicit data and schema mappings during data= base integration = =3D E =3D * [[http://www.ebay.com|EBay]] - = * 532 nodes cluster (8 * 532 cores, 5.3PB). * Heavy usage of Java MapReduce, Pig, Hive, HBase * Using it for Search optimization and Research. = * [[http://www.enet.gr|Enet]], 'Eleftherotypia' newspaper, Greece - = * Experimental installation - storage for logs and digital assets * Currently 5 nodes cluster * Using hadoop for log analysis/data mining/machine learning = * [[http://www.enormo.com/|Enormo]] - = * 4 nodes cluster (32 cores, 1TB). * We use Hadoop to filter and index our listings, removing exact duplic= ates and grouping similar ones. * We plan to use Pig very shortly to produce statistics. = * [[http://blog.espol.edu.ec/hadoop/|ESPOL University (Escuela Superior = Polit=C3=A9cnica del Litoral) in Guayaquil, Ecuador]] - = * 4 nodes proof-of-concept cluster. * We use Hadoop in a Data-Intensive Computing capstone course. The cour= se projects cover topics like information retrieval, machine learning, soci= al network analysis, business intelligence, and network security. * The students use on-demand clusters launched using Amazon's EC2 and E= MR services, thanks to its AWS in Education program. = * [[http://www.systems.ethz.ch/education/courses/hs08/map-reduce/|ETH Zu= rich Systems Group]] - = * We are using Hadoop in a course that we are currently teaching: "Mass= ively Parallel Data Analysis with MapReduce". The course projects are based= on real use-cases from biological data analysis. * Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Di= sk) = * [[http://www.eyealike.com/|Eyealike]] - Visual Media Search Platform - = * Facial similarity and recognition across large datasets. * Image content based advertising and auto-tagging for social media. * Image based video copyright protection. = =3D F =3D * [[http://www.facebook.com/|Facebook]] - = * We use Hadoop to store copies of internal log and dimension data sour= ces and use it as a source for reporting/analytics and machine learning. * Currently we have 2 major clusters: * A 1100-machine cluster with 8800 cores and about 12 PB raw storage. @@ -247, +208 @@ * We are heavy users of both streaming as well as the Java apis. We ha= ve built a higher level data warehousing framework using these features cal= led Hive (see the http://hadoop.apache.org/hive/). We have also developed a= FUSE implementation over hdfs. = * [[http://www.foxaudiencenetwork.com|FOX Audience Network]] - = * 40 machine cluster (8 cores/machine, 2TB/machine storage) * 70 machine cluster (8 cores/machine, 3TB/machine storage) * 30 machine cluster (8 cores/machine, 4TB/machine storage) * Use for log analysis, data mining and machine learning = * [[http://www.forward3d.co.uk|Forward3D]] - = * 5 machine cluster (8 cores/machine, 5TB/machine storage) * Existing 19 virtual machine cluster (2 cores/machine 30TB storage) * Predominantly Hive and Streaming API based jobs (~20,000 jobs a week)= using [[http://github.com/trafficbroker/mandy|our Ruby library]], or see t= he [[http://oobaloo.co.uk/articles/2010/1/12/mapreduce-with-hadoop-and-ruby= .html|canonical WordCount example]]. @@ -264, +223 @@ * Machine learning = * [[http://freestylers.jp/|Freestylers]] - Image retrieval engine - = * We Japanese company Freestylers use Hadoop to build the image process= ing environment for image-based product recommendation system mainly on Ama= zon EC2, from April 2009. * Our Hadoop environment produces the original database for fast access= from our web application. * We also uses Hadoop to analyzing similarities of user's behavior. = =3D G =3D * [[http://www.gis.tw/en|GIS.FCU]] - = * Feng Chia University * 3 machine cluster (4 cores, 1TB/machine) * storeage for sensor data * [[http://www.google.com|Google]] - = * [[http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html= |University Initiative to Address Internet-Scale Computing Challenges]] = * [[http://www.gruter.com|Gruter. Corp.]] - = * 30 machine cluster (4 cores, 1TB~2TB/machine storage) * storage for blog data and web documents * used for data indexing by MapReduce * link analyzing and Machine Learning by MapReduce = * [[http://gumgum.com|GumGum]] - = * 9 node cluster (Amazon EC2 c1.xlarge) * Nightly MapReduce jobs on [[http://aws.amazon.com/elasticmapreduce/|A= mazon Elastic MapReduce]] process data stored in S3 * MapReduce jobs written in [[http://groovy.codehaus.org/|Groovy]] use = Hadoop Java APIs @@ -295, +249 @@ = =3D H =3D * [[http://www.hadoop.co.kr/|Hadoop Korean User Group]], a Korean Local = Community Team Page. - = * 50 node cluster In the Korea university network environment. * Pentium 4 PC, HDFS 4TB Storage * Used for development projects @@ -303, +256 @@ * Latent Semantic Analysis, Collaborative Filtering = * [[http://www.hotelsandaccommodation.com.au/|Hotels & Accommodation]] - = * 3 machine cluster (4 cores/machine, 2TB/machine) * Hadoop for data for search and aggregation * Hbase hosting = * [[http://www.hulu.com|Hulu]] - = * 13 machine cluster (8 cores/machine, 4TB/machine) * Log storage and analysis * Hbase hosting = * [[http://www.hundeshagen.de|Hundeshagen]] - = * 6 node cluster (each node has: 4 dual core CPUs, 1,5TB storage, 4GB R= AM, RedHat OS) * Using Hadoop for our high speed data mining applications in corporati= on with [[http://www.ehescheidung-jetzt.de|Online Scheidung]] = * [[http://www.hadoop.tw/|Hadoop Taiwan User Group]] = * [[http://net-ngo.com|Hipotecas y euribor]] - = * Evoluci=C3=B3n del euribor y valor actual * Simulador de hipotecas en crisis econ=C3=B3mica = * [[http://www.hostinghabitat.com/|Hosting Habitat]] - = * We use a customised version of Hadoop and Nutch in a currently experi= mental 6 node/Dual Core cluster environment. * What we crawl are our clients Websites and from the information we ga= ther. We fingerprint old and non updated software packages in that shared h= osting environment. We can then inform our clients that they have old and n= on updated software running after matching a signature to a Database. With = that information we know which sites would require patching as a free and c= ourtesy service to protect the majority of users. Without the technologies = of Nutch and Hadoop this would be a far harder to accomplish task. = =3D I =3D * [[http://www.ibm.com|IBM]] - = * [[http://www-03.ibm.com/press/us/en/pressrelease/22613.wss|Blue Cloud= Computing Clusters]] * [[http://www-03.ibm.com/press/us/en/pressrelease/22414.wss|University= Initiative to Address Internet-Scale Computing Challenges]] = * [[http://www.iccs.informatics.ed.ac.uk/|ICCS]] - = * We are using Hadoop and Nutch to crawl Blog posts and later process t= hem. Hadoop is also beginning to be used in our teaching and general resear= ch activities on natural language processing and machine learning. = * [[http://search.iiit.ac.in/|IIIT, Hyderabad]] - = * We use hadoop for Information Retrieval and Extraction research proje= cts. Also working on map-reduce scheduling research for multi-job environme= nts. * Our cluster sizes vary from 10 to 30 nodes, depending on the jobs. He= terogenous nodes with most being Quad 6600s, 4GB RAM and 1TB disk per node.= Also some nodes with dual core and single core configurations. = * [[http://www.imageshack.us/|ImageShack]] - = * From [[http://www.techcrunch.com/2008/05/20/update-imageshack-ceo-hin= ts-at-his-grander-ambitions/|TechCrunch]]: - = . Rather than put ads in or around the images it hosts, Levin is worki= ng on harnessing all the data his service generates about content consumption (perhaps to better target = advertising on ImageShack or to syndicate that targetting data to ad networ= ks). Like Google and Yahoo, he is deploying the open-source Hadoop software= to create a massive distributed supercomputer, but he is using it to analy= ze all the data he is collecting. = * [[http://www.imvu.com/|IMVU]] - = * We use Hadoop to analyze our virtual economy * We also use Hive to access our trove of operational data to inform pr= oduct development decisions around improving user experience and retention = as well as meeting revenue targets * Our data is stored in s3 and pulled into our clusters of up to 4 m1.l= arge EC2 instances. Our total data volume is on the order of 5Tb = * [[http://www.infolinks.com/|Infolinks]] - = * We use Hadoop to analyze production logs and to provide various stati= stics on our In-Text advertising network. * We also use Hadoop/HBase to process user interactions with advertisem= ents and to optimize ad selection. = * [[http://www.isi.edu/|Information Sciences Institute (ISI)]] - = * Used Hadoop and 18 nodes/52 cores to [[http://www.isi.edu/ant/address= /whole_internet/|plot the entire internet]]. = * [[http://infochimps.org|Infochimps]] - = * 30 node AWS EC2 cluster (varying instance size, currently EBS-backed)= managed by Chef & Poolparty running Hadoop 0.20.2+228, Pig 0.5.0+30, Azkab= an 0.04, [[http://github.com/infochimps/wukong|Wukong]] * Used for ETL & data analysis on terascale datasets, especially social= network data (on [[http://api.infochimps.com|api.infochimps.com]]) = * [[http://www.iterend.com/|Iterend]] - = * using 10 node hdfs cluster to store and process retrieved data on. = =3D J =3D * [[http://joost.com|Joost]] - = * Session analysis and report generation = * [[http://www.journeydynamics.com|Journey Dynamics]] - = * Using Hadoop MapReduce to analyse billions of lines of GPS data to cr= eate TrafficSpeeds, our accurate traffic speed forecast product. = =3D K =3D * [[http://www.kalooga.com/|Kalooga]] - Kalooga is a discovery service f= or image galleries. * [[http://www.arabaoyunlarimiz.gen.tr/araba-oyunlari/|Araba oyunlar=C4= =B1]] Araba Oyunlar=C4=B1 Sitesi. - = * Uses Hadoop, Hbase, Chukwa and Pig on a 20-node cluster for crawling,= analysis and events processing. = * [[http://katta.wiki.sourceforge.net/|Katta]] - Katta serves large Luce= ne indexes in a grid environment. - = * Uses Hadoop FileSytem, RPC and IO = * [[http://www.koubei.com/|Koubei.com]] Large local community and local = search at China. - = . Using Hadoop to process apache log, analyzing user's action and click= flow and the links click with any specified page in site and more. Using H= adoop to process whole price data user input with map/reduce. = * [[http://krugle.com/|Krugle]] - = * Source code search engine uses Hadoop and Nutch. = =3D L =3D * [[http://clic.cimec.unitn.it/|Language, Interaction and Computation La= boratory (Clic - CIMeC)]] - = * Hardware: 10 nodes, each node has 8 core and 8GB of RAM * Studying verbal and non-verbal communication. = * [[http://www.last.fm|Last.fm]] - = * 44 nodes * Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node sto= rage. * Used for charts calculation, log analysis, A/B testing @@ -421, +351 @@ * Some Hive but mainly automated Java MapReduce jobs that process ~150M= M new events/day. = * [[https://lbg.unc.edu|Lineberger Comprehensive Cancer Center - Bioinfo= rmatics Group]] This is the cancer center at UNC Chapel Hill. We are using = Hadoop/HBase for databasing and analyzing Next Generation Sequencing (NGS) = data produced for the [[http://cancergenome.nih.gov/|Cancer Genome Atlas]] = (TCGA) project and other groups. This development is based on the [[http://= seqware.sf.net|SeqWare]] open source project which includes SeqWare Query E= ngine, a database and web service built on top of HBase that stores sequenc= e data types. Our prototype cluster includes: - = * 8 dual quad core nodes running CentOS * total of 48TB of HDFS storage * HBase & Hadoop version 0.20 @@ -429, +358 @@ * [[http://www.legolas-media.com|Legolas Media]] = * [[http://www.linkedin.com|LinkedIn]] - = * We have multiple grids divided up based upon purpose. * Hardware: * 120 Nehalem-based Sun x4275, with 2x4 cores, 24GB RAM, 8x1TB SATA @@ -445, +373 @@ * We use these things for discovering People You May Know and [[http://= www.linkedin.com/careerexplorer/dashboard|other]] [[http://inmaps.linkedinl= abs.com/|fun]] [[http://www.linkedin.com/skills/|facts]]. = * [[http://www.lookery.com|Lookery]] - = * We use Hadoop to process clickstream and demographic data in order to= create web analytic reports. * Our cluster runs across Amazon's EC2 webservice and makes use of the = streaming module to use Python for most operations. = * [[http://www.lotame.com|Lotame]] - = * Using Hadoop and Hbase for storage, log analysis, and pattern discove= ry/analysis. = =3D M =3D * [[http://www.markt24.de/|Markt24]] - = * We use Hadoop to filter user behaviour, recommendations and trends fr= om externals sites * Using zkpython * Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB) = * [[http://www.crmcs.com//|MicroCode]] - = * 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage) * Financial data for search and aggregation * Customer Relation Management data for search and aggregation = * [[http://www.media6degrees.com//|Media 6 Degrees]] - = * 20 node cluster (dual quad cores, 16GB, 6TB) * Used log processing, data analysis and machine learning. * Focus is on social graph analysis and ad optimization. * Use a mix of Java, Pig and Hive. = * [[http://www.mercadolibre.com//|Mercadolibre.com]] - = * 20 nodes cluster (12 * 20 cores, 32GB, 53.3TB) * Custemers log on on-line apps * Operations log processing * Use java, pig, hive, oozie = * [[http://www.mobileanalytics.tv//|MobileAnalytic.TV]] - = * We use Hadoop to develop MapReduce algorithms: - = * Information retrival and analytics * Machine generated content - documents, text, audio, & video * Natural Language Processing @@ -497, +417 @@ * 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapRed= uce programs. = * [[http://www.mylife.com/|MyLife]] - = * 18 node cluster (Quad-Core AMD Opteron 2347, 1TB/node storage) * Powers data for search and aggregation = @@ -505, +424 @@ = =3D N =3D * [[http://www.navteqmedia.com|NAVTEQ Media Solutions]] - = * We use Hadoop/Mahout to process user interactions with advertisements= to optimize ad selection. * [[http://www.openneptune.com|Neptune]] - = * Another Bigtable cloning project using Hadoop to store large structur= ed data set. * 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM) = * [[http://www.netseer.com|NetSeer]] - - = * Up to 1000 instances on [[http://www.amazon.com/b/ref=3Dsc_fe_l_2/002= -1156069-5604805?ie=3DUTF8&node=3D201590011&no=3D3435361&me=3DA36L942TSJ2AJ= A|Amazon EC2]] * Data storage in [[http://www.amazon.com/S3-AWS-home-page-Money/b/ref= =3Dsc_fe_l_2/002-1156069-5604805?ie=3DUTF8&node=3D16427261&no=3D3435361&me= =3DA36L942TSJ2AJA|Amazon S3]] * 50 node cluster in Coloc * Used for crawling, processing, serving and log analysis = * [[http://nytimes.com|The New York Times]] - = * [[http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-supe= r-computing-fun/|Large scale image conversions]] * Used EC2 to run hadoop on a large virtual cluster = * [[http://www.ning.com|Ning]] - = * We use Hadoop to store and process our log files * We rely on Apache Pig for reporting, analytics, Cascading for machine= learning, and on a proprietary JavaScript API for ad-hoc queries * We use commodity hardware, with 8 cores and 16 GB of RAM per machine = =3D O =3D * [[http://www.optivo.com|optivo]] - Email marketing software - = * We use Hadoop to aggregate and analyse email campaigns and user inter= actions. * Developement is based on the github repository. = =3D P =3D * [[http://papertrailapp.com/|Papertrail]] - Hosted syslog and app log m= anagement - = * Hosted syslog and app log service can feed customer logs into Hadoop = for their analysis (usually with [[help.papertrailapp.com/kb/analytics/log-= analytics-with-hadoop-and-hive|Hive]]) * Most customers load gzipped TSVs from S3 (which are uploaded nightly)= into Amazon Elastic MapReduce = * [[http://parc.com|PARC]] - Used Hadoop to analyze Wikipedia conflicts = [[http://asc.parc.googlepages.com/2007-10-28-VAST2007-RevertGraph-Wiki.pdf|= paper]]. = * [[http://www.performable.com/|Performable]] - Web Analytics Software - = * We use Hadoop to process web clickstream, marketing, CRM, & email dat= a in order to create multi-channel analytic reports. * Our cluster runs on Amazon's EC2 webservice and makes use of Python f= or most of our codebase. = * [[http://pharm2phork.org|Pharm2Phork Project]] - Agricultural Traceabi= lity - = * Using Hadoop on EC2 to process observation messages generated by RFID= /Barcode readers as items move through supply chain. * Analysis of BPEL generated log files for monitoring and tuning of wor= kflow processes. = * [[http://www.powerset.com|Powerset / Microsoft]] - Natural Language Se= arch - = * up to 400 instances on [[http://www.amazon.com/b/ref=3Dsc_fe_l_2/002-= 1156069-5604805?ie=3DUTF8&node=3D201590011&no=3D3435361&me=3DA36L942TSJ2AJA= |Amazon EC2]] * data storage in [[http://www.amazon.com/S3-AWS-home-page-Money/b/ref= =3Dsc_fe_l_2/002-1156069-5604805?ie=3DUTF8&node=3D16427261&no=3D3435361&me= =3DA36L942TSJ2AJA|Amazon S3]] * Microsoft is now contributing to HBase, a Hadoop subproject ( [[http:= //port25.technet.com/archive/2008/10/14/microsoft-s-powerset-team-resumes-h= base-contributions.aspx|announcement]]). = * [[http://pressflip.com|Pressflip]] - Personalized Persistent Search - = * Using Hadoop on EC2 to process documents from a continuous web crawl = and distributed training of support vector machines * Using HDFS for large archival data storage = * [[http://www.pronux.ch|Pronux]] - = * 4 nodes cluster (32 cores, 1TB). * We use Hadoop for searching and analysis of millions of bookkeeping p= ostings * Also used as a proof of concept cluster for a cloud based ERP system = * [[http://www.pokertablestats.com/|PokerTableStats]] - = * 2 nodes cluster (16 cores, 500GB). * We use Hadoop for analyzing poker players game history and generating= gameplay related players statistics = * [[http://www.portabilite.info|Portabilit=C3=A9]] - = * 50 node cluster in Colo. * Also used as a proof of concept cluster for a cloud based ERP syste. = * [[http://www.psgtech.edu/|PSG Tech, Coimbatore, India]] - = * Multiple alignment of protein sequences helps to determine evolutiona= ry linkages and to predict molecular structures. The dynamic nature of the = algorithm coupled with data and compute parallelism of hadoop data grids im= proves the accuracy and speed of sequence alignment. Parallelism at the seq= uence and block level reduces the time complexity of MSA problems. Scalable= nature of Hadoop makes it apt to solve large scale alignment problems. * Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2= 950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive = to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD. = =3D Q =3D * [[http://www.quantcast.com/|Quantcast]] - = * 3000 cores, 3500TB. 1PB+ processing each day. * Hadoop scheduler with fully custom data path / sorter * Significant contributions to KFS filesystem = =3D R =3D * [[http://www.rackspace.com/email_hosting/|Rackspace]] - = * 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage) * Parses and indexes logs from email hosting system for search: http://= blog.racklabs.com/?p=3D66 = * [[http://www.rakuten.co.jp/|Rakuten]] - Japan's online shopping mall - = * 69 node cluster * We use Hadoop to analyze logs and mine data for recommender system an= d so on. = * [[http://www.rapleaf.com/|Rapleaf]] - = * 80 node cluster (each node has: 2 quad core CPUs, 4TB storage, 16GB R= AM) * We use hadoop to process data relating to people on the web * We also involved with Cascading to help simplify how our data flows t= hrough various processing stages = * [[http://www.recruit.jp/corporate/english/|Recruit]] - = * Hardware: 50 nodes (2*4cpu 2TB*4 disk 16GB RAM each) * We use Hadoop(Hive) to analyze logs and mine data for recommendation. = * [[http://www.reisevision.com/|reisevision]] - = * We use Hadoop for our internal search = * [[http://code.google.com/p/redpoll/|Redpoll]] - = * Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each) * We intend to parallelize some traditional classification, clustering = algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale = data sets. = * [[http://resu.me/|Resu.me]] - = * Hardware: 5 nodes * We use Hadoop to process user resume data and run algorithms for our = recommendation engine. = * [[http://www.rightnow.com/|RightNow Technologies]] - Powering Great Ex= periences - = * 16 node cluster (each node has: 2 quad core CPUs, 6TB storage, 24GB R= AM) * We use hadoop for log and usage analysis * We predominantly leverage Hive and HUE for data access = =3D S =3D * [[http://www.sara.nl/news/recent/20101103/Hadoop_proof-of-concept.html= |SARA, Netherlands]] - = * SARA has initiated a Proof-of-Concept project to evaluate the Hadoop = software stack for scientific use. = * [[http://alpha.search.wikia.com|Search Wikia]] - = * A project to help develop open source social search tools. We run a 1= 25 node hadoop cluster. = * [[http://wwwse.inf.tu-dresden.de/SEDNS/SEDNS_home.html|SEDNS]] - Secur= ity Enhanced DNS Group - = * We are gathering world wide DNS data in order to discover content dis= tribution networks and configuration issues utilizing Hadoop DFS and MapRed. = * [[http://www.sematext.com/|Sematext International]] - = * We use Hadoop to store and analyze large amounts search and performan= ce data for our [[http://www.sematext.com/search-analytics/index.html|Searc= h Analytics]] and [[http://www.sematext.com/spm/index.html|Scalable Perform= ance Monitoring]] services. = * [[http://www.slcsecurity.com/|SLC Security Services LLC]] - = * 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RA= M, RedHat OS) * We use Hadoop for our high speed data mining applications = * [[http://www.slingmedia.com/|Sling Media]] - = * We have a core Analytics group that is using a 10-Node cluster runnin= g RedHat OS * Hadoop is used as an infrastructure to run MapReduce (MR) algorithms = on a number of raw data * Raw data ingest happens hourly. Raw data comes from hardware and soft= ware systems out in the field @@ -667, +556 @@ * Plan to implement Mahout to build recommendation engine = * [[http://www.socialmedia.com/|Socialmedia.com]] - = * 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RA= M) * We use hadoop to process log data and perform on-demand analytics = * [[http://www.spadac.com/|Spadac.com]] - = * We are developing the MrGeo (Map/Reduce Geospatial) application to al= low our users to bring cloud computing to geospatial processing. * We use HDFS and MapReduce to store, process, and index geospatial ima= gery and vector data. * MrGeo is soon to be open sourced as well. = * [[http://www.specificmedia.com|Specific Media]] - = * We use Hadoop for log aggregation, reporting and analysis * Two Hadoop clusters, all nodes 16 cores, 32 GB RAM * Cluster 1: 27 nodes (total 432 cores, 544GB RAM, 280TB storage) @@ -686, +572 @@ * We contribute to Hadoop and related projects where possible, see htt= p://code.google.com/p/bigstreams/ and http://code.google.com/p/hadoop-gpl-p= acking/ = * [[http://stampedehost.com/|Stampede Data Solutions (Stampedehost.com)]] - = * Hosted Hadoop data warehouse solution provider = * [[http://www.stumbleupon.com/|StumbleUpon (StumbleUpon.com)]] - = * We use HBase to store our recommendation information and to run other= operations. We have HBase committers on staff. = =3D T =3D * [[http://www.taragana.com|Taragana]] - Web 2.0 Product development and= outsourcing services - = * We are using 16 consumer grade computers to create the cluster, conne= cted by 100 Mbps network. * Used for testing ideas for blog and other data mining. = * [[http://www.textmap.com/|The Lydia News Analysis Project]] - Stony Br= ook University - = * We are using Hadoop on 17-node and 103-node clusters of dual-core nod= es to process and extract statistics from over 1000 U.S. daily newspapers a= s well as historical archives of the New York Times and other sources. = * [[http://www.tailsweep.com/|Tailsweep]] - Ad network for blogs and soc= ial media - = * 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 sto= rage) * Used as a proof of concept cluster * Handling i.e. data mining and blog crawling = * [[http://www.thestocksprofit.com/|Technical analysis and Stock Researc= h]] - = * Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36= GB Hard Drive) = * [[http://www.tegataiphoenix.com/|Tegatai]] - = * Collection and analysis of Log, Threat, Risk Data and other Security = Information on 32 nodes (8-Core Opteron 6128 CPU, 32 GB RAM, 12 TB Storage = per node) = * [[http://www.tid.es/about-us/research-groups/|Telefonica Research]] - = * We use Hadoop in our data mining and user modeling, multimedia, and i= nternet research groups. * 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per mach= ine. = * [[http://www.telenav.com/|Telenav]] - = * 60-Node cluster for our Location-Based Content Processing including m= achine learning algorithms for Statistical Categorization, Deduping, Aggreg= ation & Curation (Hardware: 2.5 GHz Quad-core Xeon, 4GB RAM, 13TB HDFS stor= age). * Private cloud for rapid server-farm setup for stage and test environm= ents.(Using Elastic N-Node cluster) * Public cloud for exploratory projects that require rapid servers for = scalability and computing surges (Using Elastic N-Node cluster) = * [[http://www.tianya.cn/|Tianya]] - = * We use Hadoop for log analysis. = + * [[http://www.tubemogul.com|TubeMogul]] + * We use Hadoop HDFS, Map/Reduce, Hive and Hbase + = + * We manage over 300 TB of HDFS data across four Amazon EC2 Availabilit= y Zone + = * [[http://www.tufee.de/|tufee]] - = * We use Hadoop for searching and indexing = * [[http://www.twitter.com|Twitter]] - = * We use Hadoop to store and process tweets, log files, and many other = types of data generated across Twitter. We use Cloudera's CDH2 distribution= of Hadoop, and store all data as compressed LZO files. * We use both Scala and Java to access Hadoop's MapReduce APIs * We use Pig heavily for both scheduled and ad-hoc jobs, due to its abi= lity to accomplish a lot with few statements. @@ -745, +624 @@ * For more on our use of hadoop, see the following presentations: [[htt= p://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|Had= oop and Pig at Twitter]] and [[http://www.slideshare.net/kevinweil/protocol= -buffers-and-hadoop-at-twitter|Protocol Buffers and Hadoop at Twitter]] = * [[http://tynt.com|Tynt]] - = * We use Hadoop to assemble web publishers' summaries of what users are= copying from their websites, and to analyze user engagement on the web. * We use Pig and custom Java map-reduce code, as well as chukwa. * We have 94 nodes (752 cores) in our clusters, as of July 2010, but th= e number grows regularly. = =3D U =3D * [[http://glud.udistrital.edu.co|Universidad Distrital Francisco Jose d= e Caldas (Grupo GICOGE/Grupo Linux UD GLUD/Grupo GIGA]] - = . 5 node low-profile cluster. We use Hadoop to support the research pro= ject: Territorial Intelligence System of Bogota City. = * [[http://ir.dcs.gla.ac.uk/terrier/|University of Glasgow - Terrier Tea= m]] - = * 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). We use Hadoop to facilitate information retrieval research & experiment= ation, particularly for TREC, using the Terrier IR platform. The open sourc= e release of [[http://ir.dcs.gla.ac.uk/terrier/|Terrier]] includes large-sc= ale distributed indexing using Hadoop Map Reduce. = * [[http://www.umiacs.umd.edu/~jimmylin/cloud-computing/index.html|Unive= rsity of Maryland]] - = . We are one of six universities participating in IBM/Google's academic= cloud computing initiative. Ongoing research and teaching efforts include = projects in machine translation, language modeling, bioinformatics, email a= nalysis, and image processing. = * [[http://hcc.unl.edu|University of Nebraska Lincoln, Holland Computing= Center]] - = . We currently run one medium-sized Hadoop cluster (1.6PB) to store and= serve up physics data for the computing portion of the Compact Muon Soleno= id (CMS) experiment. This requires a filesystem which can download data at = multiple Gbps and process data at an even higher rate locally. Additionally= , several of our students are involved in research projects on Hadoop. = * [[http://dbis.informatik.uni-freiburg.de/index.php?project=3DDiPoS|Uni= versity of Freiburg - Databases and Information Systems]] - = . 10 nodes cluster (Dell PowerEdge R200 with Xeon Dual Core 3.16GHz, 4G= B RAM, 3TB/node storage). . Our goal is to develop techniques for the Semantic Web that take adva= ntage of MapReduce (Hadoop) and its scaling-behavior to keep up with the gr= owing proliferation of semantic data. * [[http://dbis.informatik.uni-freiburg.de/?project=3DDiPoS/RDFPath.htm= l|RDFPath]] is an expressive RDF path language for querying large RDF graph= s with MapReduce. @@ -777, +650 @@ = =3D V =3D * [[http://www.veoh.com|Veoh]] - = * We use a small Hadoop cluster to reduce usage data for internal metri= cs, for search indexing and for recommendation data. = * [[http://www.vibyggerhus.se/|Bygga hus]] - = * We use a Hadoop cluster to for search and indexing for our projects. = * [[http://www.visiblemeasures.com|Visible Measures Corporation]] uses H= adoop as a component in our Scalable Data Pipeline, which ultimately powers= !VisibleSuite and other products. We use Hadoop to aggregate, store, and a= nalyze data related to in-stream viewing behavior of Internet video audienc= es. Our current grid contains more than 128 CPU cores and in excess of 100 = terabytes of storage, and we plan to grow that substantially during 2008. = * [[http://www.vksolutions.com/|VK Solutions]] - = * We use a small Hadoop cluster in the scope of our general research ac= tivities at [[http://www.vklabs.com|VK Labs]] to get a faster data access f= rom web applications. * We also use Hadoop for filtering and indexing listing, processing log= analysis, and for recommendation data. = =3D W =3D * [[http://www.web-alliance.fr|Web Alliance]] - = * We use Hadoop for our internal search engine optimization (SEO) tools= . It allows us to store, index, search data in a much faster way. * We also use it for logs analysis and trends prediction. * [[http://www.worldlingo.com/|WorldLingo]] - = * Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage,= 8GB RAM) * Each server runs Xen with one Hadoop/HBase instance and another insta= nce with web or application servers, giving us 88 usable virtual machines. * We run two separate Hadoop/HBase clusters with 22 nodes each. @@ -808, +676 @@ =3D X =3D =3D Y =3D * [[http://www.yahoo.com/|Yahoo!]] - = * More than 100,000 CPUs in >40,000 computers running Hadoop * Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) - = * Used to support research for Ad Systems and Web Search * Also used to do scaling tests to support development of Hadoop on la= rger clusters * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more abo= ut how we use Hadoop. @@ -819, +685 @@ = =3D Z =3D * [[http://www.zvents.com/|Zvents]] - = * 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storag= e) * Run Naive Bayes classifiers in parallel over crawl data to discover e= vent information =20