On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: > Might be of interest to all you Mahouts out there... http://bixolabs.com/datasets/public-terabyte-dataset-project/ > > Would be cool to get this converted over to our vector format so > that we can cluster, etc. How much additional space would be required for the vectors, in some optimal compressed format? Say as a percentage of raw text size. I'm asking because I have some flexibility in the processing and associated metadata I can store as part of the dataset. -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g