mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Codebase refactoring proposal
Date Wed, 04 Feb 2015 18:51:41 GMT
Just looked at Gokhan’s PR and quoting him below (since Andrew would like the PR). We really
need to support interchangeability of data and algorithms with Spark/MLlib/SparkQL. Even if
this breaks engine-neutrality and we adopt the lower level integration of data types. Why
can’t we address this and become a better ML engine in the process?

=================================================================================================

The status is that I need to revise the code based on reviews.

But I have some concerns, summarized below:

Here is the story.

I'm going to contribute my recent work on distributed implementation of stochastic optimization
to some open source library, and for me, the only reason that accumulating blocks matters
is that I require it for averaging-based distributed stochastic gradient descent (DSGD).

I was an advocate of having Mahout as the ML and Matrix Computations core for distributed
processing engines, and was thinking that the Matrix DSL would be sufficient for implementing
such algorithms (such as DSGD) in an engine-agnostic way.

It seems that for implementing most optimization algorithms and ML models, one requires other-than-DSL
operations. And those operations are highly engine-specific.

Repeating the aggregating operation in Mahout is duplicate work, just like MLlib's having
some of Mahout's Matrix DSL capabilities duplicated in uglier ways. Plus, having an algorithm
in Mahout but not in MLlib (or vice versa) really bothers me because other's users could not
benefit.

Considering your recent codebase refactoring effort, @dlyubimov, I imagine the best way to
use the DSL is by utilizing it inside MLlib (or whatever your favorite ML library is). That
is, MLlib depends on Mahout Matrix-DSL implementation, Matrix I/O and computations are handled
in Mahout, ML algorithms are handled in MLlib and/or other libraries.

Can we just slow this down and think about what should be contributed to where, and reconsider
the ideal Mahout-Spark integration?

On Feb 4, 2015, at 10:37 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

btw a good seq2sparse and seqdirectory ports are the only thing that
separates us from having bigram, trigram based LSA tutorial.

On Wed, Feb 4, 2015 at 10:35 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> i think they are debating the details now, not the idea. Like how "NA" is
> different from "null" in classic dataframe representation etc.
> 
> On Wed, Feb 4, 2015 at 8:18 AM, Suneel Marthi <suneel.marthi@gmail.com>
> wrote:
> 
>> I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
>> must admit Dmitriy had suggested this to me few months ago reusing
>> SchemaRDD if possible. Dmitriy was right "U told us".
>> 
>> On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <pat@occamsmachete.com>
>> wrote:
>> 
>>> This sound like a great idea but I wonder is we can get rid of Mahout
>> DRM
>>> as a native format. If we have DataFrames (have they actually renamed
>>> SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
>>> IndexedDatasets, right? This would be a huge step! If we get data
>>> interchangeability with MLlib its a win. If we get general row and
>> column
>>> IDs that follow the data through math, its a win. Need to think through
>> how
>>> to use a DataFrame in a streaming case, probably through some
>> checkpointing
>>> of the window DStream—hmm.
>>> 
>>> On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap.dev@outlook.com> wrote:
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>> I'd suggest to consider this: remember all this talk about
>>>> language-integrated spark ql being basically dataframe manipulation
>> DSL?
>>>> 
>>>> so now Spark devs are noticing this generality as well and are
>> actually
>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>>>> structure. (my "told you so" moment of sorts :)
>>>> 
>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>> renamed
>>>> DataFrame our two major structures. In particular, standardize on
>> using
>>>> DataFrame for things that may include non-numerical data and require
>> more
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>>>> when it deals with non-matrix content.
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pat@occamsmachete.com>
>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
>> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for
>> use
>>> with
>>>>> joins? This would get rid of sequence files completely from the
>>> pipeline.
>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>> scalable using joins as an alternative with the same API allowing the
>>> user
>>>>> to trade-off footprint for speed.
>>> 
>>> I think you're right- should be relatively easy.  I've been looking at
>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>>> is that we don't have a distributed data structure for strings..Seems
>> like
>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>>> of this problem.
>>> 
>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>>> time in.
>>> 
>>> I think that this would be very worthy effort as well-  I believe
>>> seq2sparse is a particular strong mahout feature.
>>> 
>>> I'll start another thread since we're now way off topic from the
>>> refactoring proposal.
>>>>> 
>>>>> My use for TF-IDF is for row similarity and would take a DRM
>> (actually
>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>> only
>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>> but
>>>>> for text tokens something like cosine may be better.
>>>>> 
>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>> a
>>> lot
>>>>> like how CF preferences are downsampled. This would produce an
>>> sparsified
>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>> the
>>>>> terms before row similarity uses cosine. This is not so good for
>> search
>>> but
>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>> and
>>> does
>>>>> it for all pairs rather than one at a time.
>>>>> 
>>>>> In any case it can be used to do a create a personalized
>> content-based
>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>> 
>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap.dev@outlook.com>
>> wrote:
>>>>> 
>>>>> 
>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>> Some issues WRT lower level Spark integration:
>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>> actually
>>>>> looked at. There may be other things we can pick up from their
>>> committers
>>>>> since they have an abundance.
>>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>> to
>>>>> me when someone on the Spark list asked about matrix transpose and an
>>> MLlib
>>>>> committer’s answer was something like “why would you want to do
>> that?”.
>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>> present
>>> you
>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>> stuff.
>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know).
>> If
>>> the
>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>> the
>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>> MLlib
>>>>> seems to be algorithms, not math.
>>>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>> fall
>>>>> out of DStream backed DRMs. This isn’t the same as incremental
>> updates
>>> on
>>>>> streaming but it’s a start.
>>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O,
Flink
>>>>> faster compute engines. So we jumped. Now the need is for streaming
>> and
>>>>> especially incrementally updated streaming. Seems like we need to
>>> address
>>>>> this.
>>>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>  I will put a PR up soon.
>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>> classes
>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>> available
>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>> require dictionary and a frequency count maps to vectorize incoming
>>> text-
>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>> they
>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>> Hopefully they'll be of some use.
>>>>> 
>>>>> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>>>>>>> But first I need to do massive fixes and improvements to the
>>> distributed
>>>>>>> optimizer itself. Still waiting on green light for that.
>>>>>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dlieu.7@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pat@occamsmachete.com>
>> wrote:
>>>>>>>>> BTW what level of difficulty would making the DSL run
on MLlib
>>> Vectors
>>>>>>>> and RowMatrix be? Looking at using their hashing TF-IDF but
it
>> raises
>>>>>>>> impedance mismatch between DRM and MLlib RowMatrix. This
would
>>> further
>>>>>>>> reduce artifact size by a bunch.
>>>>>>>> 
>>>>>>>> Short answer, if it were possible, I'd not bother with Mahout
code
>>>>> base at
>>>>>>>> all. The problem is it lacks sufficient flexibility semantics
and
>>>>>>>> abstruction. Breeze is indefinitely better in that department
but
>> at
>>>>> the
>>>>>>>> time it was sufficiently worse on abstracting interoperability
of
>>>>> matrices
>>>>>>>> with different structures. And mllib does not expose breeze.
>>>>>>>> 
>>>>>>>> Looking forward toward hardware acellerated bolt-on work
I just
>> must
>>>>> say
>>>>>>>> after reading breeze code for some time I still have much
clearer
>>> plan
>>>>> how
>>>>>>>> such back hybridization and cost calibration might work with
>> current
>>>>> Mahout
>>>>>>>> math abstractions than with breeze. It is also more in line
with
>> my
>>>>> current
>>>>>>>> work tasks.
>>>>>>>> 
>>>>>>>>> Also backing something like a DRM with DStreams. Periodic
model
>>> recalc
>>>>>>>> with streams is maybe the first step towards truly streaming
>> algos.
>>>>> Looking
>>>>>>>> at DStream -> DRM conversion for A’A, A’B, and AA’
in item and row
>>>>>>>> similarity. Attach Kafka and get evergreen models, if not
>>> incrementally
>>>>>>>> updating models.
>>>>>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>>> wrote:
>>>>>>>>> bottom line compile-time dependencies are satisfied with
no extra
>>>>> stuff
>>>>>>>>> from mr-legacy or its transitives. This is proven by
virtue of
>>>>>>>> successful
>>>>>>>>> compilation with no dependency on mr-legacy on the tree.
>>>>>>>>> 
>>>>>>>>> Runtime sufficiency for no extra dependency is proven
via running
>>>>> shell
>>>>>>>> or
>>>>>>>>> embedded tests (unit tests) which are successful too.
This
>> implies
>>>>>>>>> embedding and shell apis.
>>>>>>>>> 
>>>>>>>>> Issue with guava is typical one. if it were an issue,
i wouldn't
>> be
>>>>> able
>>>>>>>> to
>>>>>>>>> compile and/or run stuff. Now, question is what do we
do if
>> drivers
>>>>> want
>>>>>>>>> extra stuff that is not found in Spark.
>>>>>>>>> 
>>>>>>>>> Now, It is so nice not to depend on anything extra so
i am
>> hesitant
>>> to
>>>>>>>>> offer anything  here. either shading or lib with opt-in
>> dependency
>>>>> policy
>>>>>>>>> would suffice though, since it doesn't look like we'd
have to
>> have
>>>>> tons
>>>>>>>> of
>>>>>>>>> extra for drivers.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <
>> pat@occamsmachete.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>>> I vaguely remember there being a Guava version problem
where the
>>>>>>>> version
>>>>>>>>>> had to be rolled back in one of the hadoop modules.
The
>> math-scala
>>>>>>>>>> IndexedDataset shouldn’t care about version.
>>>>>>>>>> 
>>>>>>>>>> BTW It seems pretty easy to take out the option parser
and
>> replace
>>>>> with
>>>>>>>>>> match and tuples especially if we can extend the
Scala App
>> class.
>>> It
>>>>>>>> might
>>>>>>>>>> actually simplify things since I can then use several
case
>> classes
>>> to
>>>>>>>> hold
>>>>>>>>>> options (scopt needed one object), which in turn
takes out all
>>> those
>>>>>>>> ugly
>>>>>>>>>> casts. I’ll take a look next time I’m in there.
>>>>>>>>>> 
>>>>>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> in 'spark' module it is overwritten with spark dependency,
which
>>> also
>>>>>>>> comes
>>>>>>>>>> at the same version so happens. so should be fine
with 1.1.x
>>>>>>>>>> 
>>>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli)
@
>>>>>>>>>> mahout-spark_2.10 ---
>>>>>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>>>>>>> [INFO] |  |  |  +-
>> org.apache.commons:commons-math:jar:2.1:compile
>>>>>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>> commons-logging:commons-logging:jar:1.1.3:compile
>>>>>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>> commons-digester:commons-digester:jar:1.8:compile
>>>>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>>>>>>> [INFO] |  |  |  |  \-
>>>>>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>>>>>>> [INFO] |  |  |  +-
>> org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>>>> 
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>> javax.inject:javax.inject:jar:1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>> aopalliance:aopalliance:jar:1.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     \-
>>>>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>>> com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>> com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  \-
>>>>>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>>>> 
>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  \-
>>>>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>>>>>>> [INFO] |  |  \-
>>> commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>>>>>>> [INFO] |  +-
>> org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>>>>> 
>>> 
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  \-
>>>>>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |     \-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>>>>>>>> [INFO] |  |        \-
>>>>>>>>>> 
>>>>> 
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>>>>>>>>> [INFO] |  +-
>>>>>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  +-
>>>>>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  \-
>>>>>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |     \-
>>>>>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>>>>>>> d
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov
<
>>> dlieu.7@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> looks like it is also requested by mahout-math,
wonder what is
>>> using
>>>>>>>> it
>>>>>>>>>>> there.
>>>>>>>>>>> 
>>>>>>>>>>> At very least, it needs to be synchronized to
the one currently
>>> used
>>>>>>>> by
>>>>>>>>>>> spark.
>>>>>>>>>>> 
>>>>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli)
@
>>>>>>>> mahout-hadoop
>>>>>>>>>>> ---
>>>>>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>>>>>>> *[INFO] +-
>> org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>>>>>>> [INFO] +-
>>>>>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Looks like Guava is in Spark.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <
>> pat@occamsmachete.com>
>>>>>>>> wrote:
>>>>>>>>>>>> IndexedDataset uses Guava. Can’t tell from
sure but it sounds
>>> like
>>>>>>>> this
>>>>>>>>>>>> would not be included since I think it was
taken from the
>>> mrlegacy
>>>>>>>> jar.
>>>>>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov
<
>>> dlieu.7@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>> From: "Pat Ferrel" <pat@occamsmachete.com>
>>>>>>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>>>>>>> To: <dev@mahout.apache.org>
>>>>>>>>>>>> Cc:
>>>>>>>>>>>> 
>>>>>>>>>>>>> When you get a chance a PR would be good.
>>>>>>>>>>>> Yes, it would. And not just for that.
>>>>>>>>>>>> 
>>>>>>>>>>>>> As I understand it you are putting some
class jars somewhere
>> in
>>>>> the
>>>>>>>>>>>> classpath. Where? How?
>>>>>>>>>>>> /bin/mahout
>>>>>>>>>>>> 
>>>>>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout
classpath'
>> vs.
>>>>>>>>>>>> 'bin/mahout -spark'.)
>>>>>>>>>>>> 
>>>>>>>>>>>> If i interpret current shell code there correctky,
legacy path
>>>>> tries
>>>>>>>> to
>>>>>>>>>>>> use
>>>>>>>>>>>> examples assemblies if not packaged, or /lib
if packaged. True
>>>>>>>>>> motivation
>>>>>>>>>>>> of that significantly predates 2010 and i
suspect only Benson
>>> knows
>>>>>>>>>> whole
>>>>>>>>>>>> true intent there.
>>>>>>>>>>>> 
>>>>>>>>>>>> The spark path, which is really a quick hack
of the script,
>> tries
>>>>> to
>>>>>>>> get
>>>>>>>>>>>> only selected mahout jars and locally instlalled
spark
>> classpath
>>>>>>>> which i
>>>>>>>>>>>> guess is just the shaded spark jar in recent
spark releases.
>> It
>>>>> also
>>>>>>>>>>>> apparently tries to include /libs/*, which
is never compiled
>> in
>>>>>>>>>> unpackaged
>>>>>>>>>>>> version, and now i think it is a bug it is
included  because
>>>>> /libs/*
>>>>>>>> is
>>>>>>>>>>>> apparently legacy packaging, and shouldnt
be used  in spark
>> jobs
>>>>>>>> with a
>>>>>>>>>>>> wildcard. I cant beleive how lazy i am, i
still did not find
>> time
>>>>> to
>>>>>>>>>>>> understand mahout build in all cases.
>>>>>>>>>>>> 
>>>>>>>>>>>> I am not even sure if packaged mahout will
work with spark,
>>>>> honestly,
>>>>>>>>>>>> because of the /lib. Never tried that, since
i mostly use
>>>>> application
>>>>>>>>>>>> embedding techniques.
>>>>>>>>>>>> 
>>>>>>>>>>>> The same solution may apply to adding external
dependencies
>> and
>>>>>>>> removing
>>>>>>>>>>>> the assembly in the Spark module. Which would
leave only one
>>> major
>>>>>>>> build
>>>>>>>>>>>> issue afaik.
>>>>>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy
Lyubimov <
>>> dlieu.7@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> No, no PR. Only experiment on private.
But i believe i
>>>>> sufficiently
>>>>>>>>>>>> defined
>>>>>>>>>>>>> what i want to do in order to gauge if
we may want to
>> advance it
>>>>>>>> some
>>>>>>>>>>>> time
>>>>>>>>>>>>> later. Goal is much lighter dependency
for spark code.
>> Eliminate
>>>>>>>>>>>> everything
>>>>>>>>>>>>> that is not compile-time dependent. (and
a lot of it is thru
>>>>> legacy
>>>>>>>> MR
>>>>>>>>>>>> code
>>>>>>>>>>>>> which we of course don't use).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cant say i understand the remaining issues
you are talking
>> about
>>>>>>>>>> though.
>>>>>>>>>>>>> If you are talking about compiling lib
or shaded assembly,
>> no,
>>>>> this
>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> do anything about it. Although point
is, as it stands, the
>>> algebra
>>>>>>>> and
>>>>>>>>>>>>> shell don't have any external dependencies
but spark and
>> these 4
>>>>>>>> (5?)
>>>>>>>>>>>>> mahout jars so they technically don't
even need an assembly
>> (as
>>>>>>>>>>>>> demonstrated).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As i said, it seems driver code is the
only one that may need
>>> some
>>>>>>>>>>>> external
>>>>>>>>>>>>> dependencies, but that's a different
scenario from those i am
>>>>>>>> talking
>>>>>>>>>>>>> about. But i am relatively happy with
having the first two
>>> working
>>>>>>>>>>>> nicely
>>>>>>>>>>>>> at this point.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat
Ferrel <
>>>>> pat@occamsmachete.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is there a PR? You mention a "tiny
mahout-hadoop” module. It
>>>>> would
>>>>>>>> be
>>>>>>>>>>>> nice
>>>>>>>>>>>>>> to see how you’ve structured that
in case we can use the
>> same
>>>>>>>> model to
>>>>>>>>>>>>>> solve the two remaining refactoring
issues.
>>>>>>>>>>>>>> 1) external dependencies in the spark
module
>>>>>>>>>>>>>> 2) no spark or h2o in the release
artifacts.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon
Quinn <
>> squinn@gatech.edu>
>>>>>>>> wrote:
>>>>>>>>>>>>>> Also +1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> iPhone'd
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew
Palumbo <
>> ap.dev@outlook.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sent from my Verizon Wireless
4G LTE smartphone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <div>-------- Original
message --------</div><div>From:
>>> Dmitriy
>>>>>>>>>>>> Lyubimov
>>>>>>>>>>>>>> <dlieu.7@gmail.com> </div><div>Date:01/23/2015
 6:06 PM
>>>>>>>> (GMT-05:00)
>>>>>>>>>>>>>> </div><div>To: dev@mahout.apache.org
</div><div>Subject:
>>>>> Codebase
>>>>>>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>>>>>>> </div>
>>>>>>>>>>>>>>> So right now mahout-spark depends
on mr-legacy.
>>>>>>>>>>>>>>> I did quick refactoring and it
turns out it only
>> _irrevocably_
>>>>>>>>>> depends
>>>>>>>>>>>> on
>>>>>>>>>>>>>>> the following classes there:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> MatrixWritable, VectorWritable,
Varint/Varlong and
>>>>> VarintWritable,
>>>>>>>>>> and
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So  I just dropped those five
classes into new a new tiny
>>>>>>>>>>>> mahout-hadoop
>>>>>>>>>>>>>>> module (to signify stuff that
is directly relevant to
>>>>> serializing
>>>>>>>>>>>> thigns
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> DFS API) and completely removed
mrlegacy and its transients
>>> from
>>>>>>>>>> spark
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So non-cli applications (shell
scripts and embedded api
>> use)
>>>>>>>> actually
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> need spark dependencies (which
come from SPARK_HOME
>> classpath,
>>>>> of
>>>>>>>>>>>> course)
>>>>>>>>>>>>>>> and mahout jars (mahout-spark,
mahout-math(-scala),
>>>>> mahout-hadoop
>>>>>>>> and
>>>>>>>>>>>>>>> optionally mahout-spark-shell
(for running shell)).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This of course still doesn't
address driver problems that
>> want
>>>>> to
>>>>>>>>>>>> throw
>>>>>>>>>>>>>>> more stuff into front-end classpath
(such as cli parser)
>> but
>>> at
>>>>>>>> least
>>>>>>>>>>>> it
>>>>>>>>>>>>>>> renders transitive luggage of
mr-legacy (and the size of
>>>>>>>>>>>> worker-shipped
>>>>>>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> How does that sound?
>>>>>>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 


Mime
View raw message