opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: English 300k sentences Leipzig Corpus for test
Date Thu, 14 Mar 2013 17:56:40 GMT
I just use the files with the number in front of the sentence, so just 
get a 300k one
and do the processing with tokens and pos as described below.

Jörn

On 03/14/2013 03:53 PM, Jörn Kottmann wrote:
> If I remember correctly the file is already sentences by line, I used 
> the tokenizer to tokenize
> it, and the POS Tagger to pos tag it. After you did that you have 
> input files for all the tools.
>
> Maybe you need to remove the sentence id at the begin, e.g. with sed. 
> Anyway you can also leave it
> there, it doesn't really matter for this test.
>
> Jörn
>
> On 03/14/2013 03:45 PM, William Colen wrote:
>> Hi,
>>
>> I could not find a way to convert from Leipzig to other formats than 
>> DocCat
>> sample. Is it possible to convert from Leipzig to SentenceSample 
>> using the
>> OpenNLP tools?
>>
>> Thank you,
>> William
>>
>>
>> On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann <kottmann@gmail.com> 
>> wrote:
>>
>>>
>>>
>>> -------- Original Message --------
>>> Subject:        Re: English 300k sentences Leipzig Corpus for test
>>> Date:   Thu, 14 Mar 2013 09:48:21 -0300
>>> From:   William Colen <william.colen@gmail.com>
>>> To:     Jörn Kottmann <kottmann@gmail.com>
>>>
>>>
>>>
>>> Yes, you can forward.
>>>
>>> It is not clear to me how to convert it. I could only find 
>>> converters from
>>> Leipzig to DocCat.
>>>
>>>
>>> On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann <kottmann@gmail.com> 
>>> wrote:
>>>
>>>   Do you mind if I forward this to the dev list?
>>>> Yes, you need to convert the data into input data. The idea
>>>> is that we process the data with 1.5.2 and 1.5.3 and see if the output
>>>> is still identical, if its not identical its either a change in our 
>>>> code
>>>> or a bug.
>>>>
>>>> It doesn't really matter which file you download as long as it has 
>>>> enough
>>>> sentences,
>>>> would be nice if you can note in the test plan which one you used.
>>>>
>>>> Hopefully I will have sometime over the weekend to do the tests on the
>>>> private data I have.
>>>>
>>>> Jörn
>>>>
>>>>
>>>> On 03/13/2013 11:38 PM, William Colen wrote:
>>>>
>>>>   Hi, Jörn,
>>>>> I would like to start testing with Leipzig Corpus. Do you know how 
>>>>> the
>>>>> steps to do it?
>>>>>
>>>>> I downloaded the file named
>>>>> eng_news_2010_300K-text.tar.****gz<file:///Users/wcolen/**
>>>>> Desktop/opennlp1.5.3/eng_news_****2010_300K-text.tar.gz>,
>>>>>
>>>>>
>>>>> and now I would use the converter to extract documents from it.
>>>>>
>>>>> After that, I would try to use the output of a module as input to the
>>>>> next.
>>>>> Is it correct?
>>>>>
>>>>> Thank you,
>>>>> William
>>>>>
>>>>>
>>>>>
>>>
>>>
>


Mime
View raw message