opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Svetoslav Marinov <svetoslav.mari...@findwise.com>
Subject Re: Size of training data
Date Mon, 29 Apr 2013 12:32:18 GMT
Ok, I hope I do this correctly: The counter for sample object I take from
sampleStream: ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);

I use sampleStream.read() and the get 468 samples less than the number of
sentences (which are 2 611 247). Shouldn't sampleStream match the number
of sentences? I have samples without entities, but I suspect they are more
than 468. Will check though.

Otherwise I am not sure where to measure how many are processed per
second. Do you mean during the creation of the NEmodel? Or? How does one
do that? 

Thank you!

Svetoslav

On 2013-04-29 11:14, "Jörn Kottmann" <kottmann@gmail.com> wrote:

>Its a bit hard to diagnose the problem, but my best guess here is that
>for some reason the sample object stream is endless or the feature
>generation
>is very slow.
>
>Can you add a counter to your code which provides the sample object? It
>should not
>exceed your number of sentences, if the stream is endless it might be
>bigger after an hour or two.
>
>Can you measure how many of them are processed per second (should be
>more than 1k samples per second) ,
>if the throughput is too low it might just need a lot of time.
>
>Jörn
>
>On 04/29/2013 10:55 AM, Svetoslav Marinov wrote:
>> Yes, the process is at 100% CPU utilization and this is the only thing I
>> get from the jstack, no matter how many times I repeat it:
>>
>> 2013-04-29 10:47:17
>> Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):
>>
>> "Attach Listener" daemon prio=10 tid=0x00007f31a8001000 nid=0xf42b
>>waiting
>> on condition [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
>> runnable [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
>> waiting on condition [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
>> waiting on condition [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
>> runnable [0x0000000000000000]
>>     java.lang.Thread.State: RUNNABLE
>>
>> "Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
>> Object.wait() [0x00007f31ca3db000]
>>     java.lang.Thread.State: WAITING (on object monitor)
>> 	at java.lang.Object.wait(Native Method)
>> 	- waiting on <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
>> 	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
>> 	- locked <0x0000000400b8f660> (a java.lang.ref.ReferenceQueue$Lock)
>> 	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
>> 	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
>>
>> "Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
>> Object.wait() [0x00007f31ca4dc000]
>>     java.lang.Thread.State: WAITING (on object monitor)
>> 	at java.lang.Object.wait(Native Method)
>> 	- waiting on <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
>> 	at java.lang.Object.wait(Object.java:502)
>> 	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
>> 	- locked <0x0000000400b8f5f8> (a java.lang.ref.Reference$Lock)
>>
>> "main" prio=10 tid=0x00007f31d0007800 nid=0xe267 waiting on condition
>> [0x00007f31d8923000]
>>     java.lang.Thread.State: RUNNABLE
>> 	at java.util.Arrays.copyOfRange(Arrays.java:3221)
>> 	at java.lang.String.<init>(String.java:233)
>> 	at java.lang.StringBuilder.toString(StringBuilder.java:447)
>> 	at
>> 
>>opennlp.tools.util.featuregen.TokenClassFeatureGenerator.createFeatures(T
>>ok
>> enClassFeatureGenerator.java:46)
>> 	at
>> 
>>opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(Windo
>>wF
>> eatureGenerator.java:109)
>> 	at
>> 
>>opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(A
>>gg
>> regatedFeatureGenerator.java:79)
>> 	at
>> 
>>opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(Cache
>>dF
>> eatureGenerator.java:69)
>> 	at
>> 
>>opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultName
>>Co
>> ntextGenerator.java:118)
>> 	at
>> 
>>opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultName
>>Co
>> ntextGenerator.java:37)
>> 	at
>> 
>>opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEve
>>nt
>> Stream.java:103)
>> 	at
>> 
>>opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEvent
>>St
>> ream.java:126)
>> 	at
>> 
>>opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEvent
>>St
>> ream.java:37)
>> 	at
>> 
>>opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:7
>>1)
>> 	at opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
>> 	at
>> 
>>opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.ja
>>va
>> :126)
>> 	at opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
>> 	at opennlp.model.TrainUtil.train(TrainUtil.java:173)
>> 	at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
>> 	at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)
>>
>> "VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable
>>
>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800
>>nid=0xe268
>> runnable
>>
>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800
>>nid=0xe269
>> runnable
>>
>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000
>>nid=0xe26a
>> runnable
>>
>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000
>>nid=0xe26b
>> runnable
>>
>> "VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
>> waiting on condition
>>
>> JNI global references: 1139
>>
>>
>>
>> On 2013-04-29 10:26, "Jörn Kottmann" <kottmann@gmail.com> wrote:
>>
>>> On 04/29/2013 09:59 AM, Svetoslav Marinov wrote:
>>>> Below is a jstack output. It is not the third day it is running and
>>>> seems
>>>> like the process has hung up somewhere. I still haven't changed the
>>>> indexer to be one pass, so it is still two pass.
>>>>
>>>> I just wonder how long I should wait?
>>> Looks like its still fetching the events from the source, the method
>>> we can see in the stack dump are calculating the hash sum of the
>>>events,
>>> but I doubt
>>> that this is broken.
>>>
>>> Is the process at 100% CPU utilization? Is it still in the hash sum
>>>code
>>> if you repeat the jstack command a few times?
>>>
>>> Jörn
>>>
>>
>
>



Mime
View raw message