mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-1518) Preprocessing for collaborative filtering with the Scala DSL
Date Mon, 28 Apr 2014 20:01:36 GMT
[~ssc] makes sense. Is this still thought to be a stop-gap?


On Mon, Apr 28, 2014 at 12:50 PM, Sebastian Schelter (JIRA) <jira@apache.org
> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983463#comment-13983463]
>
> Sebastian Schelter commented on MAHOUT-1518:
> --------------------------------------------
>
> I thought about this issue and I think a generic solution could work as
> follows:
>
> # We have a generic dataframe that allows you to load your CSV file and
> specify a schema for that: first column has name "timestamp" and type long,
> second column has name "userid" and type string, third has name "itemid"
> and type string, fourth column has name "interaction" and type "string" or
> some enumeraton type.
> # the dataframe can be filtered by column values, so we could for example
> create a new dataframe with all rows where interaction equals "view"
> # we can extract a DRM from the dataframe, e.g. by specifying a
> dataframe-column to use as matrix row index and a dataframe-column to use
> as matrix column index, this would give us something similar to the
> IndexedDataset, a DRM + plus two bidirectional dictionaries
> # we feed the DRM into the cooccurrence code and retrieve the result as DRM
> # we have another method that converts the result DRM back to a generic
> dataframe using the bidirectional dictionary
>
> Does that make sense?
>
> > Preprocessing for collaborative filtering with the Scala DSL
> > ------------------------------------------------------------
> >
> >                 Key: MAHOUT-1518
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Collaborative Filtering
> >            Reporter: Sebastian Schelter
> >            Assignee: Sebastian Schelter
> >             Fix For: 1.0
> >
> >         Attachments: MAHOUT-1518.patch
> >
> >
> > The aim here is to provide some easy-to-use machinery to enable the
> usage of the new Cooccurrence Analysis code from MAHOUT-1464 with datasets
> represented as follows in a CSV file with the schema _timestamp, userId,
> itemId, action_, e.g.
> > {code}
> > timestamp1, userIdString1, itemIdString1, “view"
> > timestamp2, userIdString2, itemIdString1, “like"
> > {code}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message