Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Sun, 16 Oct 2011 06:47:58 -0000
Message-ID: <20111016064758.51090.57968@eos.apache.org>
Subject: 
 =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22PoweredBy=22_by_nicolas=2Ebrouss?=
 =?utf-8?q?e?=
Auto-Submitted: auto-generated

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch=
ange notification.

The "PoweredBy" page has been changed by nicolas.brousse:
http://wiki.apache.org/hadoop/PoweredBy?action=3Ddiff&rev1=3D355&rev2=3D356

Comment:
Add TubeMogul, Inc.

  =

  =3D A =3D
   * [[http://a9.com/|A9.com]] - Amazon*
- =

    * We build Amazon's product search indices using the streaming API and =
pre-existing C++, Perl, and Python tools.
    * We process millions of sessions daily for analytics, using both the J=
ava and streaming APIs.
    * Our clusters vary from 1 to 100 nodes.
  =

   * [[http://www.accelacommunications.com|Accela Communications]]
- =

    * We use a Hadoop cluster to rollup registration and view data each nig=
ht.
    * Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives
    * Each night, we run 112 Hadoop jobs
    * It is roughly 4X faster to export the transaction tables from each of=
 our reporting databases, transfer the data to the cluster, perform the rol=
lups, then import back into the databases than to perform the same rollups =
in the database.
  =

   * [[http://www.adobe.com|Adobe]]
- =

    * We use Hadoop and HBase in several areas from social services to stru=
ctured data storage and processing for internal use.
    * We currently have about 30 nodes running HDFS, Hadoop and HBase in cl=
usters ranging from 5 to 14 nodes on both production and development. We pl=
an a deployment on an 80 nodes cluster.
    * We constantly write data to HBase and run MapReduce jobs to process t=
hen store it back to HBase or external systems.
    * Our production cluster has been running since Oct 2008.
  =

   * [[http://www.adyard.de|adyard]]
- =

    * We use Flume, Hadoop and Pig for log storage and report generation as=
well as ad-Targeting.
    * We currently have 12 nodes running HDFS and Pig and plan to add more =
from time to time.
    * 50% of our recommender system is pure Pig because of it's ease of use.
    * Some of our more deeply-integrated tasks are using the streaming api =
and ruby aswell as the excellent Wukong-Library.
  =

   * [[http://www.ablegrape.com/|Able Grape]] - Vertical search engine for =
trustworthy wine information
- =

    * We have one of the world's smaller hadoop clusters (2 nodes @ 8 CPUs/=
node)
    * Hadoop and Nutch used to analyze and index textual information
  =

   * [[http://adknowledge.com/|Adknowledge]] - Ad network
- =

    * Hadoop used to build the recommender system for behavioral targeting,=
 plus other clickstream analytics
    * We handle 500MM clickstream events per day
    * Our clusters vary from 50 to 200 nodes, mostly on EC2.
    * Investigating use of R clusters atop Hadoop for statistical analysis =
and modeling at scale.
  =

   * [[http://www.aguja.de|Aguja]]- E-Commerce Data analysis
- =

    * We  use hadoop, pig and hbase to analyze search log, product view dat=
a, and analyze all of our logs
    * 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
  =

   * [[http://china.alibaba.com/|Alibaba]]
- =

    * A 15-node cluster dedicated to processing sorts of business data dump=
ed out of database and joining them together. These data will then be fed i=
nto iSearch, our vertical search engine.
    * Each node has 8 cores, 16G RAM and 1.4T storage.
  =

   * [[http://aol.com/|AOL]]
- =

    * We use hadoop for variety of things ranging from ETL style processing=
 and statistics generation to running advanced algorithms for doing behavio=
ral analysis and targeting.
    * The Cluster that we use for mainly behavioral analysis and targeting =
has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ra=
m and 800 GB hard-disk.
  =

   * [[http://www.ara.com.tr/|ARA.COM.TR]] - Ara Com Tr - Turkey's first an=
d only search engine
- =

    * We build Ara.com.tr search engine using the Python tools.
    * We use Hadoop for analytics.
    * We handle about 400TB per month
    * Our clusters vary from 10 to 100 nodes
  =

   * [[http://atbrox.com/|Atbrox]]
- =

    * We use hadoop for information extraction & search, and data analysis =
consulting
    * Cluster: we primarily use Amazon's Elastic Mapreduce
  =

   * [[http://www.ABC-Online-Shops.de/|ABC Online Shops]]
- =

    * Shop the Internet search engine
  =

  * [[http://www.aflam-online.com/|=D8=A7=D9=81=D9=84=D8=A7=D9=85 =D8=A7=D9=
=88=D9=86 =D9=84=D8=A7=D9=8A=D9=86]]
@@ -88, +76 @@

  =

  =3D B =3D
   * [[http://www.babacar.org/|BabaCar]]
- =

    * 4 nodes cluster (32 cores, 1TB).
    * We use Hadoop for searching and analysis of millions of rental bookin=
gs.
  =

   * [[http://www.backdocsearch.com|backdocsearch.com]] - search engine for=
 chiropractic information, local chiropractors, products and schools
  =

   * [[http://www.baidu.cn|Baidu]] - the leading Chinese language search en=
gine
- =

    * Hadoop used to analyze the log of search and do some mining work on w=
eb page database
    * We handle about 3000TB per week
    * Our clusters vary from 10 to 500 nodes
    * Hypertable is also supported by Baidu
  =

   * [[http://www.beebler.com|Beebler]]
- =

    * 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RA=
M)
    * We use hadoop for matching dating profiles
  =

   * [[http://www.benipaltechnologies.com|Benipal Technologies]] - Outsourc=
ing, Consulting, Innovation
- =

    * 35 Node Cluster (Core2Quad Q9400 Processor, 4-8 GB RAM, 500 GB HDD)
    * Largest Data Node with Xeon E5420*2 Processors, 64GB RAM, 3.5 TB HDD
    * Total Cluster capacity of around 20 TB on a gigabit network with fail=
over and redundancy
    * Hadoop is used for internal data crunching, application development, =
testing and getting around I/O limitations
  =

   * [[http://bixolabs.com/|Bixo Labs]] - Elastic web mining
- =

    * The Bixolabs elastic web mining platform uses Hadoop + Cascading to q=
uickly build scalable web mining applications.
    * We're doing a 200M page/5TB crawl as part of the [[http://bixolabs.co=
m/datasets/public-terabyte-dataset-project/|public terabyte dataset project=
]].
    * This runs as a 20 machine [[http://aws.amazon.com/elasticmapreduce/|E=
lastic MapReduce]] cluster.
  =

   * [[http://www.brainpad.co.jp|BrainPad]] - Data mining and analysis
- =

    * We use Hadoop to summarize of user's tracking data.
    * And use analyzing.
  =

  =3D C =3D
   * [[http://caree.rs/|Caree.rs]]
- =

    * Hardware: 15 nodes
    * We use Hadoop to process company and job data and run Machine learnin=
g algorithms for our recommendation engine.
  =

   * [[http://www.cdunow.de/|CDU now!]]
- =

    * We use Hadoop for our internal searching, filtering and indexing
  =

   * [[http://www.charlestontraveler.com/|Charleston]]
- =

    * Hardware: 15 nodes
    * We use Hadoop to process company and job data and run Machine learnin=
g algorithms for our recommendation engine.
  =

   * [[http://www.cloudspace.com/|Cloudspace]]
- =

    * Used on client projects and internal log reporting/parsing systems de=
signed to scale to infinity and beyond.
    * Client project: Amazon S3-backed, web-wide analytics platform
    * Internal: cross-architecture event log aggregation & processing
  =

   * [[http://www.contextweb.com/|Contextweb]] - Ad Exchange
- =

    * We use Hadoop to store ad serving logs and use it as a source for ad =
optimizations, analytics, reporting and machine learning.
    * Currently we have a 50 machine cluster with 400 cores and about 140TB=
 raw storage. Each (commodity) node has 8 cores and 16GB of RAM.
  =

   * [[http://www.cooliris.com|Cooliris]] - Cooliris transforms your browse=
r into a lightning fast, cinematic way to browse photos and videos, both on=
line and on your hard drive.
- =

    * We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB=
 ram, and 3-4 TB of storage.
    * We use Hadoop for all of our analytics, and we use Pig to allow PMs a=
nd non-engineers the freedom to query the data in an ad-hoc manner.
  =

   * [[http://www.weblab.infosci.cornell.edu/|Cornell University Web Lab]]
- =

    * Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB =
RAM, 72GB Hard Drive)
  =

   * [[http://www.crs4.it|CRS4]]
- =

    * [[http://dx.doi.org/10.1109/ICPPW.2009.37|Computational biology appli=
cations]]
    * [[http://www.springerlink.com/content/np5u8k1x9l6u755g|HDFS as a VM r=
epository for virtual clusters]]
  =

   * [[http://crowdmedia.de/|crowdmedia]]
- =

    * Crowdmedia has a 5 Node Hadoop cluster for statistical analysis
    * We use Hadoop to analyse trends on Facebook and other social networks
  =

  =3D D =3D
   * [[http://datagraph.org/|Datagraph]]
- =

    * We use Hadoop for batch-processing large [[http://www.w3.org/RDF/|RDF=
]] datasets, in particular for indexing RDF data.
    * We also use Hadoop for executing long-running offline [[http://en.wik=
ipedia.org/wiki/SPARQL|SPARQL]] queries for clients.
    * We use Amazon S3 and Cassandra to store input RDF datasets and output=
 files.
@@ -180, +152 @@

    * We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluste=
r sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of =
millions to billions of RDF statements).
  =

   * [[http://www.deepdyve.com|Deepdyve]]
- =

    * Elastic cluster with 5-80 nodes
    * We use hadoop to create our indexes of deep web content and to provid=
e a high availability and high bandwidth storage service for index shards f=
or our search cluster.
  =

   * [[http://www.wirtschaftsdetektei-berlin.de|Detektei Berlin]]
- =

    * We are using Hadoop in our data mining and multimedia/internet resear=
ch groups.
    * 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
  =

   * [[http://search.detik.com|Detikcom]] - Indonesia's largest news portal
- =

    * We use hadoop, pig and hbase to analyze search log, generate Most Vie=
w News, generate top wordcloud, and analyze all of our logs
    * Currently We use 9 nodes
  =

   * [[http://www.dropfire.com|DropFire]]
- =

    * We generate Pig Latin scripts that describe structural and semantic c=
onversions between data contexts
    * We use Hadoop to execute these scripts for production-level deploymen=
ts
    * Eliminates the need for explicit data and schema mappings during data=
base integration
  =

  =3D E =3D
   * [[http://www.ebay.com|EBay]]
- =

    * 532 nodes cluster (8 * 532 cores, 5.3PB).
    * Heavy usage of Java MapReduce, Pig, Hive, HBase
    * Using it for Search optimization and Research.
  =

   * [[http://www.enet.gr|Enet]], 'Eleftherotypia' newspaper, Greece
- =

    * Experimental installation - storage for logs and digital assets
    * Currently 5 nodes cluster
    * Using hadoop for log analysis/data mining/machine learning
  =

   * [[http://www.enormo.com/|Enormo]]
- =

    * 4 nodes cluster (32 cores, 1TB).
    * We use Hadoop to filter and index our listings, removing exact duplic=
ates and grouping similar ones.
    * We plan to use Pig very shortly to produce statistics.
  =

   * [[http://blog.espol.edu.ec/hadoop/|ESPOL University (Escuela Superior =
Polit=C3=A9cnica del Litoral) in Guayaquil, Ecuador]]
- =

    * 4 nodes proof-of-concept cluster.
    * We use Hadoop in a Data-Intensive Computing capstone course. The cour=
se projects cover topics like information retrieval, machine learning, soci=
al network analysis, business intelligence, and network security.
    * The students use on-demand clusters launched using Amazon's EC2 and E=
MR services, thanks to its AWS in Education program.
  =

   * [[http://www.systems.ethz.ch/education/courses/hs08/map-reduce/|ETH Zu=
rich Systems Group]]
- =

    * We are using Hadoop in a course that we are currently teaching: "Mass=
ively Parallel Data Analysis with MapReduce". The course projects are based=
 on real use-cases from biological data analysis.
    * Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Di=
sk)
  =

   * [[http://www.eyealike.com/|Eyealike]] - Visual Media Search Platform
- =

    * Facial similarity and recognition across large datasets.
    * Image content based advertising and auto-tagging for social media.
    * Image based video copyright protection.
  =

  =3D F =3D
   * [[http://www.facebook.com/|Facebook]]
- =

    * We use Hadoop to store copies of internal log and dimension data sour=
ces and use it as a source for reporting/analytics and machine learning.
    * Currently we have 2 major clusters:
     * A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
@@ -247, +208 @@

     * We are heavy users of both streaming as well as the Java apis. We ha=
ve built a higher level data warehousing framework using these features cal=
led Hive (see the http://hadoop.apache.org/hive/). We have also developed a=
 FUSE implementation over hdfs.
  =

   * [[http://www.foxaudiencenetwork.com|FOX Audience Network]]
- =

    * 40 machine cluster (8 cores/machine, 2TB/machine storage)
    * 70 machine cluster (8 cores/machine, 3TB/machine storage)
    * 30 machine cluster (8 cores/machine, 4TB/machine storage)
    * Use for log analysis, data mining and machine learning
  =

   * [[http://www.forward3d.co.uk|Forward3D]]
- =

    * 5 machine cluster (8 cores/machine, 5TB/machine storage)
    * Existing 19 virtual machine cluster (2 cores/machine 30TB storage)
    * Predominantly Hive and Streaming API based jobs (~20,000 jobs a week)=
 using [[http://github.com/trafficbroker/mandy|our Ruby library]], or see t=
he [[http://oobaloo.co.uk/articles/2010/1/12/mapreduce-with-hadoop-and-ruby=
.html|canonical WordCount example]].
@@ -264, +223 @@

    * Machine learning
  =

   * [[http://freestylers.jp/|Freestylers]] - Image retrieval engine
- =

    * We Japanese company Freestylers use Hadoop to build the image process=
ing environment for image-based product recommendation system mainly on Ama=
zon EC2, from April 2009.
    * Our Hadoop environment produces the original database for fast access=
 from our web application.
    * We also uses Hadoop to analyzing similarities of user's behavior.
  =

  =3D G =3D
   * [[http://www.gis.tw/en|GIS.FCU]]
- =

    * Feng Chia University
    * 3 machine cluster (4 cores, 1TB/machine)
    * storeage for sensor data
   * [[http://www.google.com|Google]]
- =

    * [[http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html=
|University Initiative to Address Internet-Scale Computing Challenges]]
  =

   * [[http://www.gruter.com|Gruter. Corp.]]
- =

    * 30 machine cluster (4 cores, 1TB~2TB/machine storage)
    * storage for blog data and web documents
    * used for data indexing by MapReduce
    * link analyzing and Machine Learning by MapReduce
  =

   * [[http://gumgum.com|GumGum]]
- =

    * 9 node cluster (Amazon EC2 c1.xlarge)
    * Nightly MapReduce jobs on [[http://aws.amazon.com/elasticmapreduce/|A=
mazon Elastic MapReduce]] process data stored in S3
    * MapReduce jobs written in [[http://groovy.codehaus.org/|Groovy]] use =
Hadoop Java APIs
@@ -295, +249 @@

  =

  =3D H =3D
   * [[http://www.hadoop.co.kr/|Hadoop Korean User Group]], a Korean Local =
Community Team Page.
- =

    * 50 node cluster In the Korea university network environment.
     * Pentium 4 PC, HDFS 4TB Storage
    * Used for development projects
@@ -303, +256 @@

     * Latent Semantic Analysis, Collaborative Filtering
  =

   * [[http://www.hotelsandaccommodation.com.au/|Hotels & Accommodation]]
- =

    * 3 machine cluster (4 cores/machine, 2TB/machine)
    * Hadoop for data for search and aggregation
    * Hbase hosting
  =

   * [[http://www.hulu.com|Hulu]]
- =

    * 13 machine cluster (8 cores/machine, 4TB/machine)
    * Log storage and analysis
    * Hbase hosting
  =

   * [[http://www.hundeshagen.de|Hundeshagen]]
- =

    * 6 node cluster (each node has: 4 dual core CPUs, 1,5TB storage, 4GB R=
AM, RedHat OS)
    * Using Hadoop for our high speed data mining applications in corporati=
on with [[http://www.ehescheidung-jetzt.de|Online Scheidung]]
  =

   * [[http://www.hadoop.tw/|Hadoop Taiwan User Group]]
  =

   * [[http://net-ngo.com|Hipotecas y euribor]]
- =

    * Evoluci=C3=B3n del euribor y valor actual
    * Simulador de hipotecas en crisis econ=C3=B3mica
  =

   * [[http://www.hostinghabitat.com/|Hosting Habitat]]
- =

    * We use a customised version of Hadoop and Nutch in a currently experi=
mental 6 node/Dual Core cluster environment.
    * What we crawl are our clients Websites and from the information we ga=
ther. We fingerprint old and non updated software packages in that shared h=
osting environment. We can then inform our clients that they have old and n=
on updated software running after matching a signature to a Database. With =
that information we know which sites would require patching as a free and c=
ourtesy service to protect the majority of users. Without the technologies =
of Nutch and Hadoop this would be a far harder to accomplish task.
  =

  =3D I =3D
   * [[http://www.ibm.com|IBM]]
- =

    * [[http://www-03.ibm.com/press/us/en/pressrelease/22613.wss|Blue Cloud=
 Computing Clusters]]
    * [[http://www-03.ibm.com/press/us/en/pressrelease/22414.wss|University=
 Initiative to Address Internet-Scale Computing Challenges]]
  =

   * [[http://www.iccs.informatics.ed.ac.uk/|ICCS]]
- =

    * We are using Hadoop and Nutch to crawl Blog posts and later process t=
hem. Hadoop is also beginning to be used in our teaching and general resear=
ch activities on natural language processing and machine learning.
  =

   * [[http://search.iiit.ac.in/|IIIT, Hyderabad]]
- =

    * We use hadoop for Information Retrieval and Extraction research proje=
cts. Also working on map-reduce scheduling research for multi-job environme=
nts.
    * Our cluster sizes vary from 10 to 30 nodes, depending on the jobs. He=
terogenous nodes with most being Quad 6600s, 4GB RAM and 1TB disk per node.=
 Also some nodes with dual core and single core configurations.
  =

   * [[http://www.imageshack.us/|ImageShack]]
- =

    * From [[http://www.techcrunch.com/2008/05/20/update-imageshack-ceo-hin=
ts-at-his-grander-ambitions/|TechCrunch]]:
- =

     . Rather than put ads in or around the images it hosts, Levin is worki=
ng on harnessing all the data his
     service generates about content consumption (perhaps to better target =
advertising on ImageShack or to syndicate that targetting data to ad networ=
ks). Like Google and Yahoo, he is deploying the open-source Hadoop software=
 to create a massive distributed supercomputer, but he is using it to analy=
ze all the data he is collecting.
  =

   * [[http://www.imvu.com/|IMVU]]
- =

    * We use Hadoop to analyze our virtual economy
    * We also use Hive to access our trove of operational data to inform pr=
oduct development decisions around improving user experience and retention =
as well as meeting revenue targets
    * Our data is stored in s3 and pulled into our clusters of up to 4 m1.l=
arge EC2 instances. Our total data volume is on the order of 5Tb
  =

   * [[http://www.infolinks.com/|Infolinks]]
- =

    * We use Hadoop to analyze production logs and to provide various stati=
stics on our In-Text advertising network.
    * We also use Hadoop/HBase to process user interactions with advertisem=
ents and to optimize ad selection.
  =

   * [[http://www.isi.edu/|Information Sciences Institute (ISI)]]
- =

    * Used Hadoop and 18 nodes/52 cores to [[http://www.isi.edu/ant/address=
/whole_internet/|plot the entire internet]].
  =

   * [[http://infochimps.org|Infochimps]]
- =

    * 30 node AWS EC2 cluster (varying instance size, currently EBS-backed)=
 managed by Chef & Poolparty running Hadoop 0.20.2+228, Pig 0.5.0+30, Azkab=
an 0.04, [[http://github.com/infochimps/wukong|Wukong]]
    * Used for ETL & data analysis on terascale datasets, especially social=
 network data (on [[http://api.infochimps.com|api.infochimps.com]])
  =

   * [[http://www.iterend.com/|Iterend]]
- =

    * using 10 node hdfs cluster to store and process retrieved data on.
  =

  =3D J =3D
   * [[http://joost.com|Joost]]
- =

    * Session analysis and report generation
  =

   * [[http://www.journeydynamics.com|Journey Dynamics]]
- =

    * Using Hadoop MapReduce to analyse billions of lines of GPS data to cr=
eate TrafficSpeeds, our accurate traffic speed forecast product.
  =

  =3D K =3D
   * [[http://www.kalooga.com/|Kalooga]] - Kalooga is a discovery service f=
or image galleries.
   * [[http://www.arabaoyunlarimiz.gen.tr/araba-oyunlari/|Araba oyunlar=C4=
=B1]] Araba Oyunlar=C4=B1 Sitesi.
- =

    * Uses Hadoop, Hbase, Chukwa and Pig on a 20-node cluster for crawling,=
 analysis and events processing.
  =

   * [[http://katta.wiki.sourceforge.net/|Katta]] - Katta serves large Luce=
ne indexes in a grid environment.
- =

    * Uses Hadoop FileSytem, RPC and IO
  =

   * [[http://www.koubei.com/|Koubei.com]] Large local community and local =
search at China.
- =

    . Using Hadoop to process apache log, analyzing user's action and click=
 flow and the links click with any specified page in site and more. Using H=
adoop to process whole price data user input with map/reduce.
  =

   * [[http://krugle.com/|Krugle]]
- =

    * Source code search engine uses Hadoop and Nutch.
  =

  =3D L =3D
   * [[http://clic.cimec.unitn.it/|Language, Interaction and Computation La=
boratory  (Clic - CIMeC)]]
- =

    * Hardware: 10 nodes, each node has 8 core and 8GB of RAM
    * Studying verbal and non-verbal communication.
  =

   * [[http://www.last.fm|Last.fm]]
- =

    * 44 nodes
    * Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node sto=
rage.
    * Used for charts calculation, log analysis, A/B testing
@@ -421, +351 @@

    * Some Hive but mainly automated Java MapReduce jobs that process ~150M=
M new events/day.
  =

   * [[https://lbg.unc.edu|Lineberger Comprehensive Cancer Center - Bioinfo=
rmatics Group]] This is the cancer center at UNC Chapel Hill. We are using =
Hadoop/HBase for databasing and analyzing Next Generation Sequencing (NGS) =
data produced for the [[http://cancergenome.nih.gov/|Cancer Genome Atlas]] =
(TCGA) project and other groups. This development is based on the [[http://=
seqware.sf.net|SeqWare]] open source project which includes SeqWare Query E=
ngine, a database and web service built on top of HBase that stores sequenc=
e data types. Our prototype cluster includes:
- =

    * 8 dual quad core nodes running CentOS
    * total of 48TB of HDFS storage
    * HBase & Hadoop version 0.20
@@ -429, +358 @@

   * [[http://www.legolas-media.com|Legolas Media]]
  =

   * [[http://www.linkedin.com|LinkedIn]]
- =

    * We have multiple grids divided up based upon purpose.
     * Hardware:
      * 120 Nehalem-based Sun x4275, with 2x4 cores, 24GB RAM, 8x1TB SATA
@@ -445, +373 @@

    * We use these things for discovering People You May Know and [[http://=
www.linkedin.com/careerexplorer/dashboard|other]] [[http://inmaps.linkedinl=
abs.com/|fun]] [[http://www.linkedin.com/skills/|facts]].
  =

   * [[http://www.lookery.com|Lookery]]
- =

    * We use Hadoop to process clickstream and demographic data in order to=
 create web analytic reports.
    * Our cluster runs across Amazon's EC2 webservice and makes use of the =
streaming module to use Python for most operations.
  =

   * [[http://www.lotame.com|Lotame]]
- =

    * Using Hadoop and Hbase for storage, log analysis, and pattern discove=
ry/analysis.
  =

  =3D M =3D
   * [[http://www.markt24.de/|Markt24]]
- =

    * We use Hadoop to filter user behaviour, recommendations and trends fr=
om externals sites
    * Using zkpython
    * Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB)
  =

   * [[http://www.crmcs.com//|MicroCode]]
- =

    * 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage)
    * Financial data for search and aggregation
    * Customer Relation Management data for search and aggregation
  =

   * [[http://www.media6degrees.com//|Media 6 Degrees]]
- =

    * 20 node cluster (dual quad cores, 16GB, 6TB)
    * Used log processing, data analysis and machine learning.
    * Focus is on social graph analysis and ad optimization.
    * Use a mix of Java, Pig and Hive.
  =

   * [[http://www.mercadolibre.com//|Mercadolibre.com]]
- =

    * 20 nodes cluster (12 * 20 cores, 32GB, 53.3TB)
    * Custemers log on on-line apps
    * Operations log processing
    * Use java, pig, hive, oozie
  =

   * [[http://www.mobileanalytics.tv//|MobileAnalytic.TV]]
- =

    * We use Hadoop to develop MapReduce algorithms:
- =

     * Information retrival and analytics
     * Machine generated content - documents, text, audio, & video
     * Natural Language Processing
@@ -497, +417 @@

    * 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapRed=
uce programs.
  =

   * [[http://www.mylife.com/|MyLife]]
- =

    * 18 node cluster (Quad-Core AMD Opteron 2347, 1TB/node storage)
    * Powers data for search and aggregation
  =

@@ -505, +424 @@

  =

  =3D N =3D
   * [[http://www.navteqmedia.com|NAVTEQ Media Solutions]]
- =

    * We use Hadoop/Mahout to process user interactions with advertisements=
 to optimize ad selection.
   * [[http://www.openneptune.com|Neptune]]
- =

    * Another Bigtable cloning project using Hadoop to store large structur=
ed data set.
    * 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM)
  =

   * [[http://www.netseer.com|NetSeer]] -
- =

    * Up to 1000 instances on [[http://www.amazon.com/b/ref=3Dsc_fe_l_2/002=
-1156069-5604805?ie=3DUTF8&node=3D201590011&no=3D3435361&me=3DA36L942TSJ2AJ=
A|Amazon EC2]]
    * Data storage in [[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=
=3Dsc_fe_l_2/002-1156069-5604805?ie=3DUTF8&node=3D16427261&no=3D3435361&me=
=3DA36L942TSJ2AJA|Amazon S3]]
    * 50 node cluster in Coloc
    * Used for crawling, processing, serving and log analysis
  =

   * [[http://nytimes.com|The New York Times]]
- =

    * [[http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-supe=
r-computing-fun/|Large scale image conversions]]
    * Used EC2 to run hadoop on a large virtual cluster
  =

   * [[http://www.ning.com|Ning]]
- =

    * We use Hadoop to store and process our log files
    * We rely on Apache Pig for reporting, analytics, Cascading for machine=
 learning, and on a proprietary JavaScript API for ad-hoc queries
    * We use commodity hardware, with 8 cores and 16 GB of RAM per machine
  =

  =3D O =3D
   * [[http://www.optivo.com|optivo]] - Email marketing software
- =

    * We use Hadoop to aggregate and analyse email campaigns and user inter=
actions.
    * Developement is based on the github repository.
  =

  =3D P =3D
   * [[http://papertrailapp.com/|Papertrail]] - Hosted syslog and app log m=
anagement
- =

    * Hosted syslog and app log service can feed customer logs into Hadoop =
for their analysis (usually with [[help.papertrailapp.com/kb/analytics/log-=
analytics-with-hadoop-and-hive|Hive]])
    * Most customers load gzipped TSVs from S3 (which are uploaded nightly)=
 into Amazon Elastic MapReduce
  =

   * [[http://parc.com|PARC]] - Used Hadoop to analyze Wikipedia conflicts =
[[http://asc.parc.googlepages.com/2007-10-28-VAST2007-RevertGraph-Wiki.pdf|=
paper]].
  =

   * [[http://www.performable.com/|Performable]] - Web Analytics Software
- =

    * We use Hadoop to process web clickstream, marketing, CRM, & email dat=
a in order to create multi-channel analytic reports.
    * Our cluster runs on Amazon's EC2 webservice and makes use of Python f=
or most of our codebase.
  =

   * [[http://pharm2phork.org|Pharm2Phork Project]] - Agricultural Traceabi=
lity
- =

    * Using Hadoop on EC2 to process observation messages generated by RFID=
/Barcode readers as items move through supply chain.
    * Analysis of BPEL generated log files for monitoring and tuning of wor=
kflow processes.
  =

   * [[http://www.powerset.com|Powerset / Microsoft]] - Natural Language Se=
arch
- =

    * up to 400 instances on [[http://www.amazon.com/b/ref=3Dsc_fe_l_2/002-=
1156069-5604805?ie=3DUTF8&node=3D201590011&no=3D3435361&me=3DA36L942TSJ2AJA=
|Amazon EC2]]
    * data storage in [[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=
=3Dsc_fe_l_2/002-1156069-5604805?ie=3DUTF8&node=3D16427261&no=3D3435361&me=
=3DA36L942TSJ2AJA|Amazon S3]]
    * Microsoft is now contributing to HBase, a Hadoop subproject ( [[http:=
//port25.technet.com/archive/2008/10/14/microsoft-s-powerset-team-resumes-h=
base-contributions.aspx|announcement]]).
  =

   * [[http://pressflip.com|Pressflip]] - Personalized Persistent Search
- =

    * Using Hadoop on EC2 to process documents from a continuous web crawl =
and distributed training of support vector machines
    * Using HDFS for large archival data storage
  =

   * [[http://www.pronux.ch|Pronux]]
- =

    * 4 nodes cluster (32 cores, 1TB).
    * We use Hadoop for searching and analysis of millions of bookkeeping p=
ostings
    * Also used as a proof of concept cluster for a cloud based ERP system
  =

   * [[http://www.pokertablestats.com/|PokerTableStats]]
- =

    * 2 nodes cluster (16 cores, 500GB).
    * We use Hadoop for analyzing poker players game history and generating=
 gameplay related players statistics
  =

   * [[http://www.portabilite.info|Portabilit=C3=A9]]
- =

    * 50 node cluster in Colo.
    * Also used as a proof of concept cluster for a cloud based ERP syste.
  =

   * [[http://www.psgtech.edu/|PSG Tech, Coimbatore, India]]
- =

    * Multiple alignment of protein sequences helps to determine evolutiona=
ry linkages and to predict molecular structures. The dynamic nature of the =
algorithm coupled with data and compute parallelism of hadoop data grids im=
proves the accuracy and speed of sequence alignment. Parallelism at the seq=
uence and block level reduces the time complexity of MSA problems. Scalable=
 nature of Hadoop makes it apt to solve large scale alignment problems.
    * Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2=
950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive =
to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.
  =

  =3D Q =3D
   * [[http://www.quantcast.com/|Quantcast]]
- =

    * 3000 cores, 3500TB. 1PB+ processing each day.
    * Hadoop scheduler with fully custom data path / sorter
    * Significant contributions to KFS filesystem
  =

  =3D R =3D
   * [[http://www.rackspace.com/email_hosting/|Rackspace]]
- =

    * 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage)
    * Parses and indexes logs from email hosting system for search: http://=
blog.racklabs.com/?p=3D66
  =

   * [[http://www.rakuten.co.jp/|Rakuten]] - Japan's online shopping mall
- =

    * 69 node cluster
    * We use Hadoop to analyze logs and mine data for recommender system an=
d so on.
  =

   * [[http://www.rapleaf.com/|Rapleaf]]
- =

    * 80 node cluster (each node has: 2 quad core CPUs, 4TB storage, 16GB R=
AM)
    * We use hadoop to process data relating to people on the web
    * We also involved with Cascading to help simplify how our data flows t=
hrough various processing stages
  =

   * [[http://www.recruit.jp/corporate/english/|Recruit]]
- =

    * Hardware: 50 nodes (2*4cpu 2TB*4 disk 16GB RAM each)
    * We use Hadoop(Hive) to analyze logs and mine data for recommendation.
  =

   * [[http://www.reisevision.com/|reisevision]]
- =

    * We use Hadoop for our internal search
  =

   * [[http://code.google.com/p/redpoll/|Redpoll]]
- =

    * Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each)
    * We intend to parallelize some traditional classification, clustering =
algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale =
data sets.
  =

   * [[http://resu.me/|Resu.me]]
- =

    * Hardware: 5 nodes
    * We use Hadoop to process user resume data and run algorithms for our =
recommendation engine.
  =

   * [[http://www.rightnow.com/|RightNow Technologies]] - Powering Great Ex=
periences
- =

    * 16 node cluster (each node has: 2 quad core CPUs, 6TB storage, 24GB R=
AM)
    * We use hadoop for log and usage analysis
    * We predominantly leverage Hive and HUE for data access
  =

  =3D S =3D
   * [[http://www.sara.nl/news/recent/20101103/Hadoop_proof-of-concept.html=
|SARA, Netherlands]]
- =

    * SARA has initiated a Proof-of-Concept project to evaluate the Hadoop =
software stack for scientific use.
  =

   * [[http://alpha.search.wikia.com|Search Wikia]]
- =

    * A project to help develop open source social search tools. We run a 1=
25 node hadoop cluster.
  =

   * [[http://wwwse.inf.tu-dresden.de/SEDNS/SEDNS_home.html|SEDNS]] - Secur=
ity Enhanced DNS Group
- =

    * We are gathering world wide DNS data in order to discover content dis=
tribution networks and
    configuration issues utilizing Hadoop DFS and MapRed.
  =

   * [[http://www.sematext.com/|Sematext International]]
- =

    * We use Hadoop to store and analyze large amounts search and performan=
ce data for our [[http://www.sematext.com/search-analytics/index.html|Searc=
h Analytics]] and [[http://www.sematext.com/spm/index.html|Scalable Perform=
ance Monitoring]] services.
  =

   * [[http://www.slcsecurity.com/|SLC Security Services LLC]]
- =

    * 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RA=
M, RedHat OS)
    * We use Hadoop for our high speed data mining applications
  =

   * [[http://www.slingmedia.com/|Sling Media]]
- =

    * We have a core Analytics group that is using a 10-Node cluster runnin=
g RedHat OS
    * Hadoop is used as an infrastructure to run MapReduce (MR) algorithms =
on a number of raw data
    * Raw data ingest happens hourly. Raw data comes from hardware and soft=
ware systems out in the field
@@ -667, +556 @@

    * Plan to implement Mahout to build recommendation engine
  =

   * [[http://www.socialmedia.com/|Socialmedia.com]]
- =

    * 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RA=
M)
    * We use hadoop to process log data and perform on-demand analytics
  =

   * [[http://www.spadac.com/|Spadac.com]]
- =

    * We are developing the MrGeo (Map/Reduce Geospatial) application to al=
low our users to bring cloud computing to geospatial processing.
    * We use HDFS and MapReduce to store, process, and index geospatial ima=
gery and vector data.
    * MrGeo is soon to be open sourced as well.
  =

   * [[http://www.specificmedia.com|Specific Media]]
- =

    * We use Hadoop for log aggregation, reporting and analysis
    * Two Hadoop clusters, all nodes 16 cores, 32 GB RAM
    * Cluster 1: 27 nodes (total 432 cores, 544GB RAM, 280TB storage)
@@ -686, +572 @@

    * We  contribute to Hadoop and related projects where possible, see htt=
p://code.google.com/p/bigstreams/ and http://code.google.com/p/hadoop-gpl-p=
acking/
  =

   * [[http://stampedehost.com/|Stampede Data Solutions (Stampedehost.com)]]
- =

    * Hosted Hadoop data warehouse solution provider
  =

   * [[http://www.stumbleupon.com/|StumbleUpon (StumbleUpon.com)]]
- =

    * We use HBase to store our recommendation information and to run other=
 operations. We have HBase committers on staff.
  =

  =3D T =3D
   * [[http://www.taragana.com|Taragana]] - Web 2.0 Product development and=
 outsourcing services
- =

    * We are using 16 consumer grade computers to create the cluster, conne=
cted by 100 Mbps network.
    * Used for testing ideas for blog and other data mining.
  =

   * [[http://www.textmap.com/|The Lydia News Analysis Project]] - Stony Br=
ook University
- =

    * We are using Hadoop on 17-node and 103-node clusters of dual-core nod=
es to process and extract statistics from over 1000 U.S. daily newspapers a=
s well as historical archives of the New York Times and other sources.
  =

   * [[http://www.tailsweep.com/|Tailsweep]] - Ad network for blogs and soc=
ial media
- =

    * 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 sto=
rage)
    * Used as a proof of concept cluster
    * Handling i.e. data mining and blog crawling
  =

   * [[http://www.thestocksprofit.com/|Technical analysis and Stock Researc=
h]]
- =

    * Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36=
GB Hard Drive)
  =

   * [[http://www.tegataiphoenix.com/|Tegatai]]
- =

    * Collection and analysis of Log, Threat, Risk Data and other Security =
Information on 32 nodes (8-Core Opteron 6128 CPU, 32 GB RAM, 12 TB Storage =
per node)
  =

   * [[http://www.tid.es/about-us/research-groups/|Telefonica Research]]
- =

    * We use Hadoop in our data mining and user modeling, multimedia, and i=
nternet research groups.
    * 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per mach=
ine.
  =

   * [[http://www.telenav.com/|Telenav]]
- =

    * 60-Node cluster for our Location-Based Content Processing including m=
achine learning algorithms for Statistical Categorization, Deduping, Aggreg=
ation & Curation (Hardware: 2.5 GHz Quad-core Xeon, 4GB RAM, 13TB HDFS stor=
age).
    * Private cloud for rapid server-farm setup for stage and test environm=
ents.(Using Elastic N-Node cluster)
    * Public cloud for exploratory projects that require rapid servers for =
scalability and computing surges (Using Elastic N-Node cluster)
  =

   * [[http://www.tianya.cn/|Tianya]]
- =

    * We use Hadoop for log analysis.
  =

+  * [[http://www.tubemogul.com|TubeMogul]]
+   * We use Hadoop HDFS, Map/Reduce, Hive and Hbase
+ =

+   * We manage over 300 TB of HDFS data across four Amazon EC2 Availabilit=
y Zone
+ =

   * [[http://www.tufee.de/|tufee]]
- =

    * We use Hadoop for searching and indexing
  =

   * [[http://www.twitter.com|Twitter]]
- =

    * We use Hadoop to store and process tweets, log files, and many other =
types of data generated across Twitter. We use Cloudera's CDH2 distribution=
 of Hadoop, and store all data as compressed LZO files.
    * We use both Scala and Java to access Hadoop's MapReduce APIs
    * We use Pig heavily for both scheduled and ad-hoc jobs, due to its abi=
lity to accomplish a lot with few statements.
@@ -745, +624 @@

    * For more on our use of hadoop, see the following presentations: [[htt=
p://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|Had=
oop and Pig at Twitter]] and [[http://www.slideshare.net/kevinweil/protocol=
-buffers-and-hadoop-at-twitter|Protocol Buffers and Hadoop at Twitter]]
  =

   * [[http://tynt.com|Tynt]]
- =

    * We use Hadoop to assemble web publishers' summaries of what users are=
 copying from their websites, and to analyze user engagement on the web.
    * We use Pig and custom Java map-reduce code, as well as chukwa.
    * We have 94 nodes (752 cores) in our clusters, as of July 2010, but th=
e number grows regularly.
  =

  =3D U =3D
   * [[http://glud.udistrital.edu.co|Universidad Distrital Francisco Jose d=
e Caldas (Grupo GICOGE/Grupo Linux UD GLUD/Grupo GIGA]]
- =

    . 5 node low-profile cluster. We use Hadoop to support the research pro=
ject: Territorial Intelligence System of Bogota City.
  =

   * [[http://ir.dcs.gla.ac.uk/terrier/|University of Glasgow - Terrier Tea=
m]]
- =

    * 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage).
    We use Hadoop to facilitate information retrieval research & experiment=
ation, particularly for TREC, using the Terrier IR platform. The open sourc=
e release of [[http://ir.dcs.gla.ac.uk/terrier/|Terrier]] includes large-sc=
ale distributed indexing using Hadoop Map Reduce.
  =

   * [[http://www.umiacs.umd.edu/~jimmylin/cloud-computing/index.html|Unive=
rsity of Maryland]]
- =

    . We are one of six universities participating in IBM/Google's academic=
 cloud computing initiative. Ongoing research and teaching efforts include =
projects in machine translation, language modeling, bioinformatics, email a=
nalysis, and image processing.
  =

   * [[http://hcc.unl.edu|University of Nebraska Lincoln, Holland Computing=
 Center]]
- =

    . We currently run one medium-sized Hadoop cluster (1.6PB) to store and=
 serve up physics data for the computing portion of the Compact Muon Soleno=
id (CMS) experiment. This requires a filesystem which can download data at =
multiple Gbps and process data at an even higher rate locally. Additionally=
, several of our students are involved in research projects on Hadoop.
  =

   * [[http://dbis.informatik.uni-freiburg.de/index.php?project=3DDiPoS|Uni=
versity of Freiburg - Databases and Information Systems]]
- =

    . 10 nodes cluster (Dell PowerEdge R200 with Xeon Dual Core 3.16GHz, 4G=
B RAM, 3TB/node storage).
    . Our goal is to develop techniques for the Semantic Web that take adva=
ntage of MapReduce (Hadoop) and its scaling-behavior to keep up with the gr=
owing proliferation of semantic data.
    * [[http://dbis.informatik.uni-freiburg.de/?project=3DDiPoS/RDFPath.htm=
l|RDFPath]] is an expressive RDF path language for querying large RDF graph=
s with MapReduce.
@@ -777, +650 @@

  =

  =3D V =3D
   * [[http://www.veoh.com|Veoh]]
- =

    * We use a small Hadoop cluster to reduce usage data for internal metri=
cs, for search indexing and for recommendation data.
  =

   * [[http://www.vibyggerhus.se/|Bygga hus]]
- =

    * We use a Hadoop cluster to for search and indexing for our projects.
  =

   * [[http://www.visiblemeasures.com|Visible Measures Corporation]] uses H=
adoop as a component in our Scalable Data Pipeline, which ultimately powers=
 !VisibleSuite and other products. We use Hadoop to aggregate, store, and a=
nalyze data related to in-stream viewing behavior of Internet video audienc=
es. Our current grid contains more than 128 CPU cores and in excess of 100 =
terabytes of storage, and we plan to grow that substantially during 2008.
  =

   * [[http://www.vksolutions.com/|VK Solutions]]
- =

    * We use a small Hadoop cluster in the scope of our general research ac=
tivities at [[http://www.vklabs.com|VK Labs]] to get a faster data access f=
rom web applications.
    * We also use Hadoop for filtering and indexing listing, processing log=
 analysis, and for recommendation data.
  =

  =3D W =3D
   * [[http://www.web-alliance.fr|Web Alliance]]
- =

    * We use Hadoop for our internal search engine optimization (SEO) tools=
. It allows us to store, index, search data in a much faster way.
    * We also use it for logs analysis and trends prediction.
   * [[http://www.worldlingo.com/|WorldLingo]]
- =

    * Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage,=
 8GB RAM)
    * Each server runs Xen with one Hadoop/HBase instance and another insta=
nce with web or application servers, giving us 88 usable virtual machines.
    * We run two separate Hadoop/HBase clusters with 22 nodes each.
@@ -808, +676 @@

  =3D X =3D
  =3D Y =3D
   * [[http://www.yahoo.com/|Yahoo!]]
- =

    * More than 100,000 CPUs in >40,000 computers running Hadoop
    * Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
- =

     * Used to support research for Ad Systems and Web Search
     * Also used to do scaling tests to support development of Hadoop on la=
rger clusters
    * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more abo=
ut how we use Hadoop.
@@ -819, +685 @@

  =

  =3D Z =3D
   * [[http://www.zvents.com/|Zvents]]
- =

    * 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storag=
e)
    * Run Naive Bayes classifiers in parallel over crawl data to discover e=
vent information
 =20