opennlp-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zemerick (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
Date Thu, 21 Jun 2018 11:56:00 GMT

     [ https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeff Zemerick updated OPENNLP-1182:
-----------------------------------
    Fix Version/s:     (was: 1.8.5)

> LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
> ---------------------------------------------------------------------------
>
>                 Key: OPENNLP-1182
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1182
>             Project: OpenNLP
>          Issue Type: Bug
>    Affects Versions: 1.8.4
>            Reporter: Steve Rowe
>            Priority: Major
>
> Contrary to the docs (see below), LanguageDetectorConverterTool doesn't actually do anything
at all; the class is empty.
> {quote}
> The following sequence of commands shows how to convert the Leipzig Corpora collection
at folder leipzig-train/ to the default Language Detector format, by creating groups of 5
sentences as documents and limiting to 10000 documents per language. Them, it shuffles the
result and select the first 100000 lines as train corpus and the last 20000 as evaluation
corpus:
> {noformat}					
> $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample
5 -samplesPerLanguage 10000 > leipzig.txt
> $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt >
leipzig_shuf.txt
> $ head -100000 < leipzig_shuf.txt > leipzig.train
> $ tail -20000 < leipzig_shuf.txt > leipzig.eval
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message