opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Document Classification
Date Thu, 26 Apr 2012 07:45:21 GMT
On 04/26/2012 03:37 AM, Lance Norskog wrote:
> Cool! Yeah, Tika has one also.
>
> Now for the annoying use case: older web sites and pre-web text in
> Southeast Asia and India/Pakistan are written in phonetic USASCII.
> (They only had that technology available. Does anybody do
> classification on that kind of text?
>

I never did. Its only doing bag-of-word feature generation,
to make that work you need to tokenize your input text.
We have a learn-able tokenizer (must be trained), character-class
and whitespace tokenizer.

Jörn


Mime
View raw message