Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 45230 invoked from network); 3 Nov 2009 18:07:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Nov 2009 18:07:18 -0000 Received: (qmail 73541 invoked by uid 500); 3 Nov 2009 18:07:17 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 73463 invoked by uid 500); 3 Nov 2009 18:07:16 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 73394 invoked by uid 99); 3 Nov 2009 18:07:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 18:07:16 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.177.145.93] (HELO prxy.net) (209.177.145.93) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 18:07:05 +0000 X-Scanned-By: RAE MPP/ClamAV http://raeinternet.com/mpp X-Scanned-By: This message was scanned by MPP Free Edition (www.messagepartners.com)! Received: from [67.188.118.150] (account ken@krugler.org HELO [192.168.1.24]) by prxy.net (CommuniGate Pro SMTP 4.2.10) with ESMTP id 119743657 for mahout-user@lucene.apache.org; Tue, 03 Nov 2009 10:06:56 -0800 Message-Id: From: Ken Krugler To: mahout-user@lucene.apache.org In-Reply-To: <5E00122C-1C77-4623-922E-867502CD09AA@apache.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/ Date: Tue, 3 Nov 2009 07:49:22 -0800 References: <5E00122C-1C77-4623-922E-867502CD09AA@apache.org> X-Mailer: Apple Mail (2.936) X-Virus-Checked: Checked by ClamAV on apache.org On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: > Might be of interest to all you Mahouts out there... http://bixolabs.com/datasets/public-terabyte-dataset-project/ > > Would be cool to get this converted over to our vector format so > that we can cluster, etc. How much additional space would be required for the vectors, in some optimal compressed format? Say as a percentage of raw text size. I'm asking because I have some flexibility in the processing and associated metadata I can store as part of the dataset. -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g