mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Re: anybody want to set a record with Mahout?
Date Fri, 26 Feb 2010 06:36:12 GMT
> I've written the M/R job in DistributedRowMatrix to do transpose, but our
> document
> matrixes produced by SparseVectorsFromSequenceFiles don't have a
> integer-valued
> keys for the rows, so transpose doesn't yet make sense.  Fooey.  More work
> to do.
I suppose its a simple enough M/R job. See the MeanShiftCanopyCreator in
MAHOUT-307, we will M/R over the dataset and assign ids based on the map
attempt and output the int id => vector (vector itself has the documents ids

My mind was wandering and was thinking of giving the record attempt a better
purpose than just creating junk ngram data(its good enough for a record
There are a couple of datasets we can explore, like the genome dataset.
all these are 150-200GB datasets
or there is the wikipedia edits dataset (1TB+) which has all versions of all
the documents

   - *Annotated Human Genome Data provided by ENSEMBL*
   Ensembl project produces genome databases for human as well as almost 50
   other species, and makes this information freely available.

   - *Various US Census Databases from The US Census Bureau*
   States demographic data from the 1980, 1990, and 2000 US Censuses, summary
   information about Business and Industry , and 2003-2006 Economic Household
   Profile Data.

   - *UniGene provided by the National Center for Biotechnology Information
   set of transcript sequences of well-characterized genes and hundreds of
   thousands of expressed sequence tags (EST) that provide an organized view of
   the transcriptome.

   - *Freebase Data Dump from*
   data dump of all the current facts and assertions in the Freebase system.
   Freebase <> is an open database of the world’s
   information, covering millions of topics in hundreds of categories. Drawing
   from large open data sets like Wikipedia, MusicBrainz, and the SEC archives,
   it contains structured information on many popular topics, including movies,
   music, people and locations – all reconciled and freely available.

Any thoughts on these.

Say for freebase, we can generate topic -> item matrix?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message