Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 48541 invoked from network); 3 Nov 2009 18:14:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Nov 2009 18:14:44 -0000 Received: (qmail 83707 invoked by uid 500); 3 Nov 2009 18:14:44 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 83617 invoked by uid 500); 3 Nov 2009 18:14:43 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 83606 invoked by uid 99); 3 Nov 2009 18:14:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 18:14:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.211.183 as permitted sender) Received: from [209.85.211.183] (HELO mail-yw0-f183.google.com) (209.85.211.183) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 18:14:32 +0000 Received: by ywh13 with SMTP id 13so6652260ywh.29 for ; Tue, 03 Nov 2009 10:14:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=8dIgTz0/npXc0CsL9ucmknGDTHMU6dIZEks6nS3c4Ug=; b=RkRzggJnKOnFlScGOkd7YdNAoGu6iTemmZjSHa9ZrVJexcT5EKEFAUmtpmwVOHlIQ5 /KE1UTNvK+xtM6JzMHqifze1vOsuwTyeDxoiPC6rlTWMULC9BXf2RPXxq7Kd3qSUzfh6 XstzS5s4KQna6xgkRT6G3IpFjL8Cr/FaTB7Qw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=DsL8S11IiHd0rR+dQTzTp2jpj7GeFqvRP3SSqrqLxtuA1giKR8UG+96Gi8Ul8EDBo2 oETkuMWvD093eMQ8Vvqr/YJXn5iCvnW1X6ODdaZyPr52V7Zb9Tz11OzS8paujL2FYUaT gq7jcc1dTM+sTxrjp6eKJkWNsojj1Z4oXJPnY= MIME-Version: 1.0 Received: by 10.90.61.31 with SMTP id j31mr873081aga.3.1257272051285; Tue, 03 Nov 2009 10:14:11 -0800 (PST) In-Reply-To: References: <5E00122C-1C77-4623-922E-867502CD09AA@apache.org> Date: Tue, 3 Nov 2009 10:14:11 -0800 Message-ID: <4b124c310911031014n6a352705qc937cfdc158c71cc@mail.gmail.com> Subject: Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/ From: Jake Mannix To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016361e7b8000c12f04777b7670 X-Virus-Checked: Checked by ClamAV on apache.org --0016361e7b8000c12f04777b7670 Content-Type: text/plain; charset=ISO-8859-1 Well the minimum size, for the IntDoubleVector which isn't yet in trunk (it's on Ted's patch which hasn't worked its way in yet) would entail one int and one double per unique term in the document, so that's 12 bytes each. Typical documents have lots of repeat terms, but most terms are smaller than 12 bytes as well... so the fraction is probably more than 10% and less than 50% is my guess. But I'm sure others around here have more experience producing large vector sets out of the text in Mahout. -jake On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler wrote: > > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: > > Might be of interest to all you Mahouts out there... >> http://bixolabs.com/datasets/public-terabyte-dataset-project/ >> >> Would be cool to get this converted over to our vector format so that we >> can cluster, etc. >> > > > How much additional space would be required for the vectors, in some > optimal compressed format? Say as a percentage of raw text size. > > I'm asking because I have some flexibility in the processing and associated > metadata I can store as part of the dataset. > > -- Ken > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > --0016361e7b8000c12f04777b7670--