Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 17593 invoked from network); 30 Mar 2009 00:25:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Mar 2009 00:25:39 -0000 Received: (qmail 16476 invoked by uid 500); 30 Mar 2009 00:25:39 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 16378 invoked by uid 500); 30 Mar 2009 00:25:38 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 16368 invoked by uid 99); 30 Mar 2009 00:25:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2009 00:25:38 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of goal.oriented.design@gmail.com designates 209.85.217.164 as permitted sender) Received: from [209.85.217.164] (HELO mail-gx0-f164.google.com) (209.85.217.164) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2009 00:25:30 +0000 Received: by gxk8 with SMTP id 8so3939541gxk.5 for ; Sun, 29 Mar 2009 17:25:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=Li8R41GFW8I0I7K5P8Op93IQ67Z14DhfqgVmtDpb1EI=; b=SKQEleLXVywKE0dS1/DpCPK4erKe8Wg8AQUK5YB4Ihf88f9BkbRqUM4zTrtZ0yXFvm xk9Y77OgaWZg58sceIJtoah2yM9yK50xXw3i/x0SjJTWTL5RtujCyPM30i3ZZKiLEOIh wkvr9vfvz3vbEArXOsB/Qltl3mm6JyZOTyZRw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=BaUNFR/i2nw7dNI7m0SM6YHU471QqEzSuZ61Nv1Q7fsX4Af9AK5xNoNJ2Vnd0oFgw3 PJyW5TnmwZW/go/XGDYejCaFitZ93oQlT5TbHY0lTNIFvGXSrCFpUDf5clVhgCoOtEnS Epbq3iNIykLff/ovFXlMp2onUBby7fFtQaUMc= MIME-Version: 1.0 Received: by 10.151.42.10 with SMTP id u10mr9040001ybj.18.1238372709347; Sun, 29 Mar 2009 17:25:09 -0700 (PDT) Date: Sun, 29 Mar 2009 17:25:09 -0700 Message-ID: <6df7b01e0903291725q2c51bc06g9767fffa51f97812@mail.gmail.com> Subject: modifications to GSoC Proposal on wiki From: Philip Ramsey To: mahout-dev@lucene.apache.org, Ed Ramsey , "Walter, Brian" Content-Type: multipart/alternative; boundary=00151750ec4c70b3af04664b1dcc X-Virus-Checked: Checked by ClamAV on apache.org --00151750ec4c70b3af04664b1dcc Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit All, I've made substantial changes to my draft proposal on the wiki at: http://wiki.apache.org/general/SoC2009/PhilipRamsey-Mahout-AlgorithmsProposal The following is the majority of what I have as my proposal thus far. Please note that the timeline description is rather sparse right now; I will be filling it in shortly, but could definitely use some guidance/advice about whether or not I am expecting too much or too little out of myself, or if this sounds like it would, in fact, be useful to the project. Thanks folks, Philip Ramsey ----------------------------------------------------------------- *Abstract:* Currently I am engaged in research that explores word sense disambiguation via co-reference resolution. Much of the work thus far has involved computing syntactic similarity of words using weighted distance measures between the probability of their occurrence in a given bi-gram. I've been using Hadoop to implement these similarity measures, with the goal of generating sets of syntactically-similar words from which phrasal/part of speech categories can be abstracted. This research has primarily been based on the results of (Lee, 2001), with the next step looking towards an abstraction of these similar sets for the purpose of grammar induction, as explored by (de Pauw, 2004). My interest in exploring Mahout as a platform for this abstraction began earlier this year, although I have not yet begun working with it. Building off of the work of (De Pauw, 2004) and (Lankhorst, 1994), I propose to implement an evolutionary algorithm framework specifically geared towards inferring grammars over a large dataset using the Watchmaker implementations currently available in Mahout. The intended consequence of this proposal is to work as a use-case test, providing patches and resolutions as needed to the near-release-ready GA package currently available in Mahout. The benefit of this test-case is that, due to the robust representation needed for NLP, a more precise example for variable optimization strategies can be explored, potentially resulting in unobserved bug-fixes and a more diverse implementation library. *Detailed Description:* Using Hadoop, we have been able to generate sets of similar words, and have begun work on linking these sets via the occurrence of a bi-gram, given that a subset of similar words occupies a specific location within it. In this scenario, the training/test data will be broken into parts such that a fitness measure can be computed based on the degree to which the results of a given generation mirror the relative frequencies observable in the data. As such, the goal will be to generate, from the training data, word co-occurrences that have an actual probability in the testing data that do not, however, occur in the training data. As sets of similar words, the generated co-occurrences will work as an inference engine for defining the rules of grammar over the dataset. Prior to May 23rd, I intend to familiarize myself with the currently-available GA packages in Mahout, using them as-is to develop an implementation for this specific test case. As the coding quarter begins, I intend to develop new classes that provide for a more robust utilization of Watchmaker within the Mahout framework. As is most likely clear, the details of this outline are very open for change/revision. Currently I am not familiar with the needs of the community in regards to the GA package. Thus, I intend to work specifically where I may be most useful to the project, as regards the Watchmaker/GA implementation. '''''Draft Timeline''''' week 1-3: Successfully implement GA test case using current Mahout tools as-is; week 4-6: generate tests using the various evolution engine classes available in Watchmaker, finding an optimal approach using their tools for multi-threading, splitting, etc. week 7-8: based on the results of the previous tests, rework/append to the org.apache.mahout.ga.watchmaker package. Week 9-10: debug/make modifications as needed to successfully complete the GSoC commitments. --00151750ec4c70b3af04664b1dcc--