Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F1D0FFDC5 for ; Wed, 27 Mar 2013 04:51:58 +0000 (UTC) Received: (qmail 79280 invoked by uid 500); 27 Mar 2013 04:51:57 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 78831 invoked by uid 500); 27 Mar 2013 04:51:55 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 78810 invoked by uid 99); 27 Mar 2013 04:51:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Mar 2013 04:51:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.223.170 as permitted sender) Received: from [209.85.223.170] (HELO mail-ie0-f170.google.com) (209.85.223.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Mar 2013 04:51:50 +0000 Received: by mail-ie0-f170.google.com with SMTP id c11so9905680ieb.15 for ; Tue, 26 Mar 2013 21:51:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=JZMxuK7yXrklvtznFEdArF2rUVIG1dfE489XfBOAm2o=; b=YHPAFufGRCwer6yB8p5Dgv9J+IorA2rASHLJffNVPFL5b8BrleT9RloxO34Y0PmlQK Ayn+KlB+DGrZ8erf1cLePlWWg5TT4gEBujtYVUmKbFuB4+1ZwjABzVqzIMCWExSJSeDb 8ncTNl/KoyuohOT7UfnX7yQf+mPhVGaeyqkaXY+EPAmEP4cFB6mpdu/EJByqQDhQjIdd 4xcPL0Bh6Oum0Lgz0crMQy99udMFdjco62KzSd28iGqox3ud2RPRLSArHQJhcUSd4Bu6 gqmTsarPlaSLJxNXXNJcHPBaj+9OO51k2jccBk9kJorKTOdt7wwEbTDQM6H244PsR7et fV/Q== X-Received: by 10.50.57.166 with SMTP id j6mr3392299igq.21.1364359890285; Tue, 26 Mar 2013 21:51:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.162.130 with HTTP; Tue, 26 Mar 2013 21:51:00 -0700 (PDT) In-Reply-To: <5151D08A.8040504@unister-gmbh.de> References: <5151CAF0.4000509@unister-gmbh.de> <5151D08A.8040504@unister-gmbh.de> From: Ted Dunning Date: Wed, 27 Mar 2013 05:51:00 +0100 Message-ID: Subject: Re: How to improve clustering? To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=14dae934036f9b3d9c04d8e0ca5f X-Virus-Checked: Checked by ClamAV on apache.org --14dae934036f9b3d9c04d8e0ca5f Content-Type: text/plain; charset=UTF-8 Uh... Shouldn't your be doing the IDF weighting *before* you normalize the vector length? On Tue, Mar 26, 2013 at 5:44 PM, Sebastian Briesemeister < sebastian.briesemeister@unister-gmbh.de> wrote: > ... > For each document, I set a field in the corresponding vector to 1 if it > contains a word. Then I normalize each vector using the L2-norm. > Finally I multiply each element (representing a word) in the vector by > log(#documents/#documents_with_word). > > For clustering, I am using cosine similarity. > --14dae934036f9b3d9c04d8e0ca5f--