opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark G <giaconiam...@gmail.com>
Subject Re: MarkableFileInputStreamFactory
Date Tue, 20 May 2014 11:26:31 GMT
That is correct , sentence file does not need annotations, and the other files Are one name
per line. 
It uses the names file to annotate the sentences, and won't annotate anything that's in the
blacklist file.

Let me know how it goes!



> On May 20, 2014, at 4:16 AM, Carlos Scheidecker <nando.nlp@gmail.com> wrote:
> 
> I have not move forward on it, but yes Mark, want to use it.
> 
> I have seen one of your examples.
> 
> But have not figured out the proper format of the files. Here' s what I
> think from what I have been reading. Tell me if I am write.
> 
> From class DefaultModelBuilderUtil method generateModel
> 
> @param sentences        a file that contains one sentence per line.
>    *                                 There should be at least 15K sentences
>    *                                 consisting of a representative sample
> from
>    *                                 user data
> 
> This seems to be a text file where each sentence is on one line.
> I wonder if it has to be annotated, for instance:
> 
> <START:person> Archimedes <END> used the method of exhaustion to
> approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
> to estimate π rigorously .
> 
> Or just:
> 
> Archimedes used the method of exhaustion to approximate the value of
> π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π rigorously .
> 
> 
> @param knownEntities            a file consisting of a simple list of
>   *                                 unambiguous entities, one entry per
> line.
>   *                                 For instance, if one was trying to
> build a
>   *                                 person NER model then this file would
> be a
>   *                                 list of person names that are
> unambiguous
>   *                                 and are known to exist in the sentences
> 
> This would be a text file list?
> 
> Something like one name per line?
> 
> Archimedes
> Socrates
> ....
> 
> 
> * @param knownEntitiesBlacklist   This file contains a list of known bad
> hits
>   *                                 that the NER phase of this processing
> might
>   *                                 catch early one before the model
> iterates
>   *                                 to maturity
> 
> Same as the knownEntities but a list of what NOT to mark as an entity?
> 
> 
> The rest seemed quite straight forward.
> 
> Thanks,
> 
> Carlos.
> 
> 
> 
> 
>> On Mon, May 19, 2014 at 5:34 PM, Mark G <giaconiamark@gmail.com> wrote:
>> 
>> No problem, Carlos are you using the model builder add on ?
>> 
>> 
>> Mg
>> 
>>>> On May 19, 2014, at 6:29 PM, Carlos Scheidecker <nando.nlp@gmail.com>
>>> wrote:
>>> 
>>> Thanks mate! Saw you updated the code. Cheers.
>>> 
>>> 
>>>> On Mon, May 19, 2014 at 3:24 PM, Mark G <giaconiamark@gmail.com> wrote:
>>>> 
>>>> OK, thanks Carlos, I think I will commit the change, seems like it
>> wouldn't
>>>> hurt. Anybody else?
>>>> 
>>>> 
>>>> On Mon, May 19, 2014 at 5:07 PM, Carlos Scheidecker <
>> nando.nlp@gmail.com
>>>>> wrote:
>>>> 
>>>>> I am having the same issue Mark.
>>>>> 
>>>>> The class is not public so it has no visibility
>>>>> inside opennlp.addons.modelbuilder.impls.GenericModelableImpl therefore
>>>> it
>>>>> cannot be built with Maven or resolved inside Eclipse.
>>>>> 
>>>>> I have also been looking at new commits to fix that and there were
>> none.
>>>>> 
>>>>> 
>>>>>> On Mon, May 12, 2014 at 1:03 PM, Mark G <markg@apache.org>
wrote:
>>>>>> 
>>>>>> Does MarkableFileInputStreamFactory need to be package private? I
am
>>>>> using
>>>>>> it in an addon (modelbuilder-addon), I would like to either move
it or
>>>>> make
>>>>>> it a public class. Perhaps I should be using a different class
>>>>> altogether?
>>>>>> 
>>>>>> I am using it like this
>>>>>> 
>>>>>>    ObjectStream<String> lineStream =
>>>>>>             new PlainTextByLineStream(new
>>>>>> MarkableFileInputStreamFactory(params.getAnnotatedTrainingDataFile()),
>>>>>> charset);
>>>>>>     ObjectStream<NameSample> sampleStream = new
>>>>>> NameSampleDataStream(lineStream);
>>>>>> 
>>>>>> where getAnnotatedTrainingDataFile returns a java File object.
>>>>>> 
>>>>>> thanks
>> 

Mime
View raw message