mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: drmFromHDFS rowLabelBindings question
Date Fri, 12 Sep 2014 23:53:11 GMT
>
> Note that there is no way (yet) to perform aggregate or reduce like
> operation through the DSL. Though the backends (both spark and h2o) support
> reduce-like operations, there is no DSL operator for that yet. We could
> either introduce a reduce/aggregate operator in as engine neutral/close to
>

we already discussed that. Big NO

(1) Engines differ in shuffle task capabilities and specifics. A LOT. It is
my belief that finding common denominator here is way to a rat hole with no
real bottom. mapBlock(), which  is translation to map task, is the only
clean exception and actually pretty useful as well.

(2) We are for R-Like algebra, not functional programming.

(3) Mixing in non-algebraic primitives will break laws of algebraic
optimization. (well, mapBlock(), binds and splits kinda do today and are
de-facto checkpoints, it wold take really a lot to optimize them over,
although ti is definitely possible in a lot of cituations).

(4) No need. This is probably the most compelling reason.
we do expect quasi-algebraic methods to be inevitable anyway, so one is to
use `rdd` property and do whatever his heart desires, with full engine
caps. Most methods do just that, happily enough. Actually, all my methods
are quasi-algebraic.  Instead of trying to standardize everything, we are
saying things are going to be quasi, in which case clean component
separation  (in OOA sense, think Strategy and perhaps Visitor patterns) of
quasi things and algebraic expressions whould go a long way to alleviate
 porting non-algebraic parts to specific engines. In that sense, Pat's
stuff does not adhere to these patterns so i imagine it would be pretty
difficult to port it to e.g. flink .


algebraic way as possible, or keep any kind of reduction/aggregate phase of
> operation backend specific (which kind of sucks)
>


>
> Thanks
>
>
>
> > > Subject: Re: drmFromHDFS rowLabelBindings question
> > > From: pat.ferrel@gmail.com
> > > Date: Fri, 12 Sep 2014 14:41:35 -0700
> > > To: dev@mahout.apache.org
> > >
> > > Not sure if this helps but we (Sebastian and I) created an
> > IndexedDataset which maintains row and column HashBiMaps that use the Int
> > key to map to/from Strings. There are Reader and Writer traits for file
> IO
> > (text files for now). The flow is to read an IndexedDataset using the
> > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and
> two
> > label BiMaps for rows and columns. This method is used in the row and
> item
> > similarity jobs where you do math things like B.t %*% A After you do the
> > math using the drm contained in the IndexedDataset you assign the correct
> > dictionaries to the resulting IndexedDataset to maintain your labels for
> > writing or further math. It might make sense to implement some of the
> math
> > ops that would work with this simple approach but in any case you can do
> it
> > explicitly as those jobs do. The idea was to support other file formats
> > like sequence files as the need comes up.
> > >
> > > On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <ap.dev@outlook.com>
> wrote:
> > >
> > > It doesn't look like it has anything to do with the conversion.
> > >
> > > after:
> > >
> > >    val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > java.lang.Integer)).toMap
> > >
> > > rowBindings.size  is one
> > >
> > > From: ap.dev@outlook.com
> > > To: dev@mahout.apache.org
> > > Subject: RE: drmFromHDFS rowLabelBindings question
> > > Date: Fri, 12 Sep 2014 15:53:48 -0400
> > >
> > >
> > >
> > >
> > > Thanks guys,  I was wondering about the java.util.Map conversion too.
> > I'll try copying everything into a java.util.HashMap and passing that to
> > setRowBindings.  I'll play around with it and if i cant get it to work,
> > I'll file a jira.
> > >
> > > I'm just using it in the NB implementation so its not a pressing issue.
> > >
> > > Appreciate it.
> > >
> > > > Date: Fri, 12 Sep 2014 12:35:21 -0700
> > > > Subject: Re: drmFromHDFS rowLabelBindings question
> > > > From: avati@gluster.org
> > > > To: dev@mahout.apache.org
> > > >
> > > > On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <avati@gluster.org>
> > wrote:
> > > >
> > > >>
> > > >>
> > > >> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <avati@gluster.org>
> > wrote:
> > > >>
> > > >>>
> > > >>>
> > > >>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> bit i you are really compelled that it is something that might
be
> > needed,
> > > >>>> the best way probably would be indeed create an optional parameter
> > to
> > > >>>> collect (something like
> > drmLike.collect(extractLabels:Boolean=false))
> > > >>>> which
> > > >>>> you can flip to true if needed and the thing does toString
on keys
> > and
> > > >>>> assinging them to in-core matrix' row labels. (requires a
patch of
> > > >>>> course)
> > > >>>>
> > > >>>>
> > > >>> As I mentioned in the other mail, this is already the case. The
> code
> > > >>> seems to assume .toMap internally does collect. My (somewhat wild)
> > > >>> suspicion is that this line is somehow fooling the eye:
> > > >>>
> > > >>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > java.lang.Integer)).toMap
> > > >>>
> > > >>>
> > > >>>
> > > >> Argh, for a moment I was thinking `d` is still an rdd. It is
> actually
> > all
> > > >> in-core, as the entirety of the rdd is collected up front into
> > `data`. In
> > > >> any case I suspect the non-int key collecting code might be doing
> > something
> > > >> funny.
> > > >>
> > > >
> > > > One problem I see is that toMap() returns scala.collections.Map,
> > whereas
> > > > the next line, m.setRowLabelBindings accepts a java.util.Map. Since
> the
> > > > code compiles fine there is probably an implicit conversion happening
> > > > somewhere, and I dont know if the conversion is doing the right
> thing.
> > > > Other than this, rest of the code seems to look fine.
> > >
> > >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message