Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 61572 invoked from network); 2 Jul 2009 18:17:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jul 2009 18:17:15 -0000 Received: (qmail 99355 invoked by uid 500); 2 Jul 2009 18:17:25 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 99321 invoked by uid 500); 2 Jul 2009 18:17:24 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 99311 invoked by uid 99); 2 Jul 2009 18:17:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2009 18:17:24 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates 209.85.212.184 as permitted sender) Received: from [209.85.212.184] (HELO mail-vw0-f184.google.com) (209.85.212.184) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2009 18:17:16 +0000 Received: by vwj14 with SMTP id 14so1003314vwj.29 for ; Thu, 02 Jul 2009 11:16:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=dhc9N93AXGLmEZz1+vbQnHXtJ5upuEW8sxoS0dInA8E=; b=XD9buGNssGrrN1/sQZS7OBRqZIRXC2AxRUAJHj2gJen7p61s5dHka/AnWMImHjxWuy wkAIjQV4N5hiBMkpQQD3xgOUo6HYrx9038m6qkCN4n4Yf49/6vCfgw73Hmj98nD8mbn5 lg5B9GLBa0rG0llthhA/GDirU13mRhsziWnfc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=GjpciWf4ndNsbxKBjFqBRUdjTvgs7qrJccB+IhrWjyp7g+I1xQWpOOmq4jrdF2gE/1 vHzrLpYge0mvUMGzlSscMiuCVvm3iObcrwiFrlfu3Capa9B636ZtgSN9ccTU48EyaYlV N/+nyRzoXPJoIWFeF8v/78gMihL9zsjiX6iAQ= MIME-Version: 1.0 Received: by 10.220.83.201 with SMTP id g9mr656216vcl.42.1246558614725; Thu, 02 Jul 2009 11:16:54 -0700 (PDT) In-Reply-To: <4A4CD38A.5070409@windwardsolutions.com> References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <37ffc8080906260921u7240f784g92f54fe4148c48c0@mail.gmail.com> <37ffc8080907010637v483ec7d6k8de9e746eda69dec@mail.gmail.com> <4B080410-D4B0-49A2-A73A-5A04B0E286A1@apache.org> <37ffc8080907020733m19eacd5fkb368dc44068da29a@mail.gmail.com> <4A4CD38A.5070409@windwardsolutions.com> Date: Thu, 2 Jul 2009 15:16:54 -0300 Message-ID: <37ffc8080907021116p2cdf679do38c5760151275db6@mail.gmail.com> Subject: Re: Clustering from DB From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Thanks for the feedback, Jeff. > The logical format of input to KMeans is as it is in sequence > file format, but the Key is never used. To my knowledge, there is no > requirement to assign identifiers to the input points*. Users are free to > associate an arbitrary name field with each vector - also label mappings may > be assigned - but these are not manipulated by KMeans or any of the other > clustering applications. The name field is now used as a vector identifier > by the KMeansClusterMapper - if it is non-null - in the output step only. The key may not be used internally, but externally they can prove to be pretty useful. For me, keys are userIDs and each Vector represents his/her historical behavior. Being able to collect the output information as is quite neat as it allows me to, for instance, retrieve user information using data directly from a HDFS file's field.