mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miles <miles...@gmail.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Fri, 13 Nov 2009 19:48:32 GMT
A very simple way to remove boilerplate (and this is trivial using Map  
Reduce) is to just remove all duplicate sentences. This does assume  
you can extract sentences, do sentence boundary detection etc.

Miled
Sent from your Ipod


On 13 Nov 2009, at 19:06, Ted Dunning <ted.dunning@gmail.com> wrote:

> This looks like a very nice approach for getting rid of the goo.  I  
> often
> advocate using words/phrases/ngrams that are highly predicted by the  
> domain
> name as an alternative for removing boilerplate.  That has the  
> advantage
> that it doesn't require training text.  In the case of wiki-pedia,  
> this is
> not so useful because everything is in the same domain.  The domain
> predictor trick will only work if the feature you are using for the  
> input is
> not very content based.  Thus, this can fail for small domain- 
> focused sites
> or if you use a content laden URL for the task.
>
>
>
> On Fri, Nov 13, 2009 at 10:36 AM, Ken Krugler
> <kkrugler_lists@transpac.com>wrote:
>
>> Hi all,
>>
>> Another issue came up, about cleaning the text.
>>
>> One interested user suggested using nCleaner (see
>> http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf) as  
>> a way
>> of tossing boilerplate text that skews text frequency data.
>>
>> Any thoughts on this?
>>
>> Thanks,
>>
>> -- Ken
>>
>>
>> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
>>
>> Might be of interest to all you Mahouts out there...
>>> http://bixolabs.com/datasets/public-terabyte-dataset-project/
>>>
>>> Would be cool to get this converted over to our vector format so  
>>> that we
>>> can cluster, etc.
>>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

Mime
View raw message