mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject [GSOC] Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use *
Date Mon, 05 Apr 2010 21:51:39 GMT
+changing subject line.

Hi Necati, Like I mentioned on JIRA ticket, you need to take a look at the
current data representation format (Vectors) and how structured data (ARFF
format) is converted to vectors. You will find a basic converter in the
utils folder under trunk.

With regard to NOSQL, the Bayes classifier already interfaces with HBASE to
store and access model stored in HBASE server. We want to extend that to a
generic Matrix adapter which can be consumed by any algorithm in Mahout.

Take a look at these open issues
https://issues.apache.org/jira/browse/MAHOUT-78
https://issues.apache.org/jira/browse/MAHOUT-202

You can follow what I did for Mahout Bayes code last year, here
https://issues.apache.org/jira/browse/MAHOUT-124


Taste already has some wrappers which reads from a MYSQL database.

What I would like in a proposal is this. Atleast for the first cut implement
a data dump tool which can dump selected fields (from SQL NOSQL) and write
them to a sequence file or better in the AVRO document format(@Drew you can
explain more here).

Similar to ARFF to vector conversion, we need to convert this document file
to SequenceFile of vectors with pluggable weighting strategies.

To understand all these in a proposal you would have to read a bit of what
is there in the code and what you think can be re-used. Feel free to post in
case you have any doubts


Robin

<https://issues.apache.org/jira/browse/MAHOUT-124>
On Tue, Apr 6, 2010 at 2:58 AM, Necati Batur <necatibatur@gmail.com> wrote:

> *IDEA:Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data
> for all the algorithms to use *
>
> *Summary*
>
> *            *First of all,I am very excited to join an organization like
> GSOC and most importantly work for a big open source Project apache.I am
> looking for a good collaboration and new challenges on software
> development.Especially information management issues sound great to me.I am
> confident to work with all new technologies.I took the data structures I ,
> II courses at university so I am ok with data structures.Most importantly I
> am interested in databases.From my software engineering courses experience
> I
> know how to work on a project by iterative development and timelining* *
>
> *About Me*
>
> I am a senior student at computer engineering at
> iztech<http://english.iyte.edu.tr/main_eng.jsp?pageName=main.htm>in
> turkey. My areas of inetrests are information management, OOP(Object
> Oriented Programming) and currently bioinformatics. I have been working
> with
> a Asistan Professor(Jens Allmer <http://jens.allmer.de/>) in molecular
> biology genetics department for one year.Firstly, we worked on a protein
> database 2DB <http://www.2db.de.ms/> and we presented the project in
> HIBIT09<http://hibit09.ii.metu.edu.tr/>organization. The Project
>  was “Database management system independence by amending 2DB with a
> database access layer”. Currently, I am working on another project (Kerb)
> as
> my senior project which is a general sqeuential task management system
> intend to reduce the errors and increase time saving in biological
> experiments. We will present this project in
> HIBIT2010<http://hibit2010.ii.metu.edu.tr/>too.
>
> *My Offer for  Project *
>
> *            *The data adapters fort he higher level languages will require
> the good capability of using data structures and some information about
> finite mathematics that I am confident on that issues.Then,the code given
> in
> svn repository seems to need some improvements and also documetation.
>
> Briefly,I would do the following operations fort his project
>
>   1. Understand the underlying maths for adapters
>   2. Determine the data structures that would be used for adapters
>   3. Implement the neccassary methods to handle creation of these
>   structures
>   4. Some test cases that we probably would need to check whether our code
>   cover all the issues required by a data retrieve operations
>   5. New iterations on the code to robust the algorithms
>   6. Documentation of overall project to join our particular Project to
>   overall scope
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message