mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Collocation clarification
Date Sat, 16 Jan 2010 02:41:00 GMT
It's not very hard to collect the abbreviations. It may be less work
than coding what's in the paper.

On Fri, Jan 15, 2010 at 9:37 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The exceptions are a list of abbreviations that have a terminal full stop,
> but which are customarily terminated by a capitalized word which is not the
> start of a new sentence.
>
> It looks to me like machine learning has come a long way in this regard.
> This is the best paper on the subject that I have seen in a quick search.
>
> Unsupervised Multilingual Sentence Boundary Detection
> <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by
> Kiss and Strunk.
>
> It doesn't require any lexical resources and can improve performance on the
> fly by adapting to the language that it is working against.
>
> The fundamental insight is that abbreviations are situations where a stem is
> very commonly followed by a full stop and a sentence start marker is
> something that is very commonly preceded by a full stop or other sentence
> marker.  From these and a few other intuitions, they build a system that is
> pretty darned accurate.  One major component of its accuracy is due to the
> ability to adapt on the fly to the corpus in use.
>
> A deficiency in our use case would be the requirement for training text, but
> that could be solved with a few moderate sized resources that are the result
> of training on reference texts for different languages.
>
> On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <drew.farris@gmail.com> wrote:
>
>> I've found abbrevs, various identifiers etc are sort of a typical case
>> where these things fall flat. I'll see how it performs viz writing
>> something from scratch and see what I can come up with.
>>
>> > Right, although just slightly ironic that we are using a rule-based
>> system for a machine learning project.
>>
>> Heh, indeed, but it seems entirely appropriate in this case. Of
>> course, now I need to go read about statistical approaches to sentence
>> boundary detection.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
View raw message