mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Fri, 13 Nov 2009 18:36:54 GMT
Hi all,

Another issue came up, about cleaning the text.

One interested user suggested using nCleaner (see http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf)

  as a way of tossing boilerplate text that skews text frequency data.

Any thoughts on this?

Thanks,

-- Ken

On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:

> Might be of interest to all you Mahouts out there...  http://bixolabs.com/datasets/public-terabyte-dataset-project/
>
> Would be cool to get this converted over to our vector format so  
> that we can cluster, etc.

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message