mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleksandr Petrov (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-522) Using different data sources for input/output
Date Mon, 04 Oct 2010 13:52:32 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917608#action_12917608
] 

Oleksandr Petrov commented on MAHOUT-522:
-----------------------------------------

@Drew
API is an awesome idea.
I'll probably polish my stuff that's related to Sequence files creation and TF, TF/IDF, Dictionary
import. I realize that existing file format is the best thing for map/reduce usage. But still,
as you mentioned, API would speed up development process a lot.

There may be an input reader, that will provide an easy iterable interfrase (getNext, maybe
- getCount) for all four of mentioned things. Plus several sample adapters: MongoDB and SQL
one, just for instance. 

I do agree that there's no such thing as a steady DB format for such things. Everyone uses
his own schema. So, moving from easy to difficult would be: 
a) allow people to use / reuse / provide their own readers
b) allow single configuration point (which is questionable, since people may want to handle
it all themselves)
c) implement a single writer interface that will accept reader type and read possible items
out of it 

We have bits of that already covered, and it surely needs to become a bit more generic to
allow reuse. I'll be working on it throughout the next week or so.

> Using different data sources for input/output
> ---------------------------------------------
>
>                 Key: MAHOUT-522
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-522
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Oleksandr Petrov
>
> Hi,
> Mahout is currently bound to the file system, at least from my feeling. Most of the time
data structures i'm working with aren't located on the file system, the same way as output
isn't bound to the FS, most of time i'm forced to export my datasets from DB to FS, and then
load them back to DB afterwards.
> Most likely, it's not quite interesting for the core developers, who're working on the
algorithms implementation to start writing adapters to DBs or anything like that.
> For instance,  SequenceFilesFromDirectory is a simple way to get your files from directory
and convert it all to Sequence Files. Some people would be extremely grateful if there would
be an interface they may implement to throw their files from DB straight to the Sequence File
without a medium of a File System. If anyone's interested, i can provide a patch.
> Second issue is related to the workflow process itself. For instance, what if i already
do have Dictionary, TF-IDF, and TF in some particular format that was created by other things
in my infrastructure. Again, I need to convert those to the Mahout data-structures. Can't
we just allow other jobs to accept more generic types (or interfaces, for instance) when working
with TF-IDF, TF and Dictionaries, without binding those to Hadoop FS. 
> I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's also an
independent project, so it may benefit and get a more wide adoption, if it allows to work
with any format. I do have an idea of how to implement it, and partially implemented it for
our infrastructure needs, but i really want to hear some output from users and hadoop developers,
whether it's suitable and if anyone may benefit out of that.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message