mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ramsey <>
Subject modifications to GSoC Proposal on wiki
Date Mon, 30 Mar 2009 00:25:09 GMT

I've made substantial changes to my draft proposal on the wiki at:

The following is the majority of what I have as my proposal thus far. Please
note that the timeline description is rather sparse right now; I will be
filling it in shortly, but could definitely use some guidance/advice about
whether or not I am expecting too much or too little out of myself, or if
this sounds like it would, in fact, be useful to the project.

Thanks folks,
Philip Ramsey



Currently I am engaged in research that explores word sense disambiguation
via co-reference resolution. Much of the work thus far has involved
computing syntactic similarity of words using weighted distance measures
between the probability of their occurrence in a given bi-gram. I've been
using Hadoop to implement these similarity measures, with the goal of
generating sets of syntactically-similar words from which phrasal/part of
speech categories can be abstracted. This research has primarily been based
on the results of (Lee, 2001), with the next step looking towards an
abstraction of these similar sets for the purpose of grammar induction, as
explored by (de Pauw, 2004). My interest in exploring Mahout as a platform
for this abstraction began earlier this year, although I have not yet begun
working with it.

Building off of the work of (De Pauw, 2004) and (Lankhorst, 1994), I propose
to implement an evolutionary algorithm framework specifically geared towards
inferring grammars over a large dataset using the Watchmaker implementations
currently available in Mahout. The intended consequence of this proposal is
to work as a use-case test, providing patches and resolutions as needed to
the near-release-ready GA package currently available in Mahout. The benefit
of this test-case is that, due to the robust representation needed for NLP,
a more precise example for variable optimization strategies can be explored,
potentially resulting in unobserved bug-fixes and a more diverse
implementation library.

*Detailed Description:*

Using Hadoop, we have been able to generate sets of similar words, and have
begun work on linking these sets via the occurrence of a bi-gram, given that
a subset of similar words occupies a specific location within it. In this
scenario, the training/test data will be broken into parts such that a
fitness measure can be computed based on the degree to which the results of
a given generation mirror the relative frequencies observable in the data.
As such, the goal will be to generate, from the training data, word
co-occurrences that have an actual probability in the testing data that do
not, however, occur in the training data. As sets of similar words, the
generated co-occurrences will work as an inference engine for defining the
rules of grammar over the dataset.

Prior to May 23rd, I intend to familiarize myself with the
currently-available GA packages in Mahout, using them as-is to develop an
implementation for this specific test case. As the coding quarter begins, I
intend to develop new classes that provide for a more robust utilization of
Watchmaker within the Mahout framework.

As is most likely clear, the details of this outline are very open for
change/revision. Currently I am not familiar with the needs of the community
in regards to the GA package. Thus, I intend to work specifically where I
may be most useful to the project, as regards the Watchmaker/GA

'''''Draft Timeline'''''

week 1-3: Successfully implement GA test case using current Mahout tools

week 4-6: generate tests using the various evolution engine classes
available in Watchmaker, finding an optimal approach using their tools for
multi-threading, splitting, etc.

week 7-8: based on the results of the previous tests, rework/append to the package.

Week 9-10: debug/make modifications as needed to successfully complete the
GSoC commitments.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message