Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 69805 invoked from network); 24 Jun 2009 03:05:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Jun 2009 03:05:48 -0000 Received: (qmail 12539 invoked by uid 500); 24 Jun 2009 03:05:59 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 12463 invoked by uid 500); 24 Jun 2009 03:05:59 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 12423 invoked by uid 99); 24 Jun 2009 03:05:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 03:05:59 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.82.104.160] (HELO web24603.mail.ird.yahoo.com) (212.82.104.160) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 24 Jun 2009 03:05:47 +0000 Received: (qmail 49972 invoked by uid 60001); 24 Jun 2009 03:05:25 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.uk; s=s1024; t=1245812725; bh=3Eh5mmHo9emTig4KLhgtxdkwIMspukReAo+oSjU10XA=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=M6Yt6gLCccE9lzLnGawquhq5t0Z0ril3YnOMCQWHlop9gLubPLZH/pAIOlVz9Xcl1c7s6YtmFXFYY/m1X0Vqa+aT+KRoGjFnPtDpTl8XTL/JZrbI15KZ4UXkTi/VqZTkibGqgw/CydSn6HAc/i9X2hdACy5BMdYll4ioRwKC924= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=Qkdjm/0S43gRGd3swZ3+QXEYAPN31IJ+kuZfZXYFvtE1zTJpyj1tp1qjLMGBvarhK3BgM4soMFuOPTxpKbzrSkJt+Fx4E/3/atuhL5xE8Tz53k7J7sIbA7uwprtjzr/xPgWIxrvMeGKKRVdhI3Q8bCLAKPm4hKtDhLCRfy2WPv0=; Message-ID: <671780.48925.qm@web24603.mail.ird.yahoo.com> X-YMail-OSG: n08EXXoVM1nrZVJD.3rzPpaYwu3Bu4oPpw_5Jr06AisUKNqiH2oRisvkbqoXRChTGpBK3SE53byfSvZAZ1BbQXw7QJvbGIAIu4_hBjT79gtEXSlo39vWfSaclHww7ah_WOSyeABpvjfbQZOlz6mBod9NwZDqNqvbQNhsfMtwZGH1Cjn_LK987zlMfRU3ZjDS9er9K2Lac47mu2BZaB5jj.eJ78h8Ea0OzxEzy2h3ty1VTDKc0zdlUG_lBo2g2BeFeNO2v0zB.bKeVC91cCGX5FsADZ24F_IKLjeKNf8vwYSK9MNussSHBb8H Received: from [79.76.203.213] by web24603.mail.ird.yahoo.com via HTTP; Tue, 23 Jun 2009 20:05:25 PDT X-Mailer: YahooMailRC/1277.43 YahooMailWebService/0.7.289.15 References: <24175732.post@talk.nabble.com> <24175800.post@talk.nabble.com> <347836.95459.qm@web24605.mail.ird.yahoo.com> Date: Tue, 23 Jun 2009 20:05:25 -0700 (PDT) From: Paul Jones Subject: Re: LSI, cosine and others which use vectors To: mahout-user@lucene.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1206415185-1245812725=:48925" X-Virus-Checked: Checked by ClamAV on apache.org --0-1206415185-1245812725=:48925 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable tks Ted, but if its a live system, and you have 10 million documents, then = isn't the computation on the fly going to be a pain, if you add say 1000 do= cs per hour or whatever, which is why I was assuming that its a batch proce= ss.=0A=0AAlso I think I have worked out what I meant about the relationship= s between the words themselves, I think I was looking to build a term-term = matrix instead of a term-doc, whereby I have the freq of occurence of each = word alongside each other word in a doc.(I guess easy way to start is that = the two words can co-occur anywhere in the doc). If done, hopefully the 'di= stance' between the two vectors should give me a relative relationship. I r= ealise lots of problems with this approach. i.e how don't know how the word= s are related...I just know that they are.=0A=0APaul=0A=0A=0A=0A=0A________= ________________________=0AFrom: Ted Dunning =0ATo: = mahout-user@lucene.apache.org=0ASent: Wednesday, 24 June, 2009 1:52:41=0ASu= bject: Re: LSI, cosine and others which use vectors=0A=0AThere are two kind= s of changes here.=0A=0AThe first kind is when a single document changes. = That will change the=0Adistances between that document and others, but it w= on't change the=0Adistances between two other documents. Most importantly,= it won't change=0Athe distance between queries and other documents.=0A=0AT= he second kind of change is due to the first and is relatively=0Aunavoidabl= e. When a document changes, almost inevitably the corpus word=0Afrequencie= s will change as a result. This changes the weightings applied to=0Apartic= ular terms in documents. When you have many documents of which few=0Achang= e these changes will be small enough to ignore.=0A=0AIn practice, you don't= much care about what has changed because a live=0Asystem computes all simi= larities or distances on the fly based on the=0Acurrent state. If the sim= ilarities that you have not yet computed change,=0Ayou don't care.=0A=0AOn = Tue, Jun 23, 2009 at 5:01 PM, Paul Jones wrote:= =0A=0A> Yes another question, am going through a rapid learning curve...=0A= >=0A> All these vector based systems, which require you to build a term-doc= etc,=0A> are they of any use in a system where the data is changing, i.e l= ets assume=0A> the docs are webpages, which are being crawled, and hence up= dated. Surely if=0A> there is a vector diagram being formed, then the posit= ion of these vectors=0A> changes based on the changes (size, content) of th= e entire matrix, or am I=0A> missing something here.=0A>=0A> If the above i= s correct, then is a actual live project how is this done,=0A> are distance= s worked out on a per-day type of basis, and the indexes then=0A> updated ?= =0A>=0A> Paul=0A>=0A>=0A>=0A>=0A=0A=0A=0A=0A-- =0ATed Dunning, CTO=0ADeepDy= ve=0A=0A111 West Evelyn Ave. Ste. 202=0ASunnyvale, CA 94086=0Ahttp://www.de= epdyve.com=0A858-414-0013 (m)=0A408-773-0220 (fax)=0A=0A=0A=0A --0-1206415185-1245812725=:48925--