opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Svetoslav Marinov <svetoslav.mari...@findwise.com>
Subject Re: Size of training data
Date Mon, 29 Apr 2013 07:59:51 GMT
Hi again, 

Below is a jstack output. It is not the third day it is running and seems
like the process has hung up somewhere. I still haven't changed the
indexer to be one pass, so it is still two pass.

I just wonder how long I should wait?

Thanks!

Svetoslav

------------------------------

Indexing events using cutoff of 6

        Computing event counts...  2013-04-26 14:37:22
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):

"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b94808> (a
java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
        - locked <0x0000000400b94808> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 runnable
[0x00007f31d8923000]
   java.lang.Thread.State: RUNNABLE
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:367)
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:390)
        at 
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:254)
        at java.lang.StringCoding.encode(StringCoding.java:289)
        at java.lang.String.getBytes(String.java:954)
        at 
opennlp.model.HashSumEventStream.next(HashSumEventStream.java:55)
        at 
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:127)
        at 
opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
        at opennlp.model.TrainUtil.train(TrainUtil.java:173)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
        at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)

"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable 

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable 

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition

JNI global references: 1139

Heap
 PSYoungGen      total 2581440K, used 2388216K [0x00000006aaab0000,
0x000000074b590000, 0x0000000800000000)
  eden space 2530304K, 94% used
[0x00000006aaab0000,0x000000073c6ee120,0x00000007451b0000)
  from space 51136K, 0% used
[0x00000007451b0000,0x00000007451b0000,0x00000007483a0000)
  to   space 48512K, 0% used
[0x0000000748630000,0x0000000748630000,0x000000074b590000)
 PSOldGen        total 167168K, used 167167K [0x0000000400000000,
0x000000040a340000, 0x00000006aaab0000)
  object space 167168K, 99% used
[0x0000000400000000,0x000000040a33fff0,0x000000040a340000)
 PSPermGen       total 21248K, used 4039K [0x00000003f5a00000,
0x00000003f6ec0000, 0x0000000400000000)
  object space 21248K, 19% used
[0x00000003f5a00000,0x00000003f5df1fe8,0x00000003f6ec0000)

2013-04-26 14:39:09
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):


"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b94808> (a
java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
        - locked <0x0000000400b94808> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 runnable
[0x00007f31d8923000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOfRange(Arrays.java:3221)
        at java.lang.String.<init>(String.java:233)
        at java.lang.StringBuilder.toString(StringBuilder.java:447)
        at 
opennlp.tools.util.featuregen.TokenFeatureGenerator.createFeatures(TokenFea
tureGenerator.java:41)
        at 
opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowF
eatureGenerator.java:95)
        at 
opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(Agg
regatedFeatureGenerator.java:79)
        at 
opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedF
eatureGenerator.java:69)
        at 
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:118)
        at 
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:37)
        at 
opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEvent
Stream.java:103)
        at 
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:126)
        at 
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:37)
        at 
opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:71)
        at 
opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
        at 
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:126)
        at 
opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
        at opennlp.model.TrainUtil.train(TrainUtil.java:173)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
        at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)

"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable 

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable 

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition

JNI global references: 1139

Heap
 PSYoungGen      total 2581440K, used 2267572K [0x00000006aaab0000,
0x000000074b590000, 0x0000000800000000)
  eden space 2530304K, 89% used
[0x00000006aaab0000,0x000000073511d138,0x00000007451b0000)
  from space 51136K, 0% used
[0x00000007451b0000,0x00000007451b0000,0x00000007483a0000)
  to   space 48512K, 0% used
[0x0000000748630000,0x0000000748630000,0x000000074b590000)
 PSOldGen        total 167168K, used 167167K [0x0000000400000000,
0x000000040a340000, 0x00000006aaab0000)
  object space 167168K, 99% used
[0x0000000400000000,0x000000040a33fff0,0x000000040a340000)
 PSPermGen       total 21248K, used 4039K [0x00000003f5a00000,
0x00000003f6ec0000, 0x0000000400000000)
  object space 21248K, 19% used
[0x00000003f5a00000,0x00000003f5df1fe8,0x00000003f6ec0000)



On 2013-04-26 13:41, "Jörn Kottmann" <kottmann@gmail.com> wrote:

>The Two Pass Data Indexer is the default, if you have a machine with
>enough
>memory you might wanna try the One Pass Data Indexer.
>Anyway, it would be nice to get a jstack to see where is spending its
>time,
>maybe there is an I/O issue?
>
>The training can take very long, but the data indexing should work.
>
>To change the indexer you can set this parameter:
>DataIndexer=OnePass
>
>HTH,
>Jörn
>
>On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:
>> I prefer the API as it gives me more flexibility and fits the overall
>> architecture of our components. But here is part of my set-up:
>>
>> Cutoff 6
>> Iterations 200
>> CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
>> tokens.
>>
>> So, I gave it a whole night and I saw the process was dead in the
>>morning.
>> But I'll give it another try and will let you know.
>>
>> Thank you!
>>
>> Svetoslav
>>
>>
>> On 2013-04-26 12:42, "Jörn Kottmann" <kottmann@gmail.com> wrote:
>>
>>> I always edit the opennlp script and change it to what I need.
>>>
>>> Anyway, we have a Two Pass Data Indexer which writes the features to
>>>disk
>>> to save memory during indexing, depending on how you train you might
>>> have a cutoff=5 which eliminates probably a lot of your features and
>>> therefore
>>> saves a lot of memory.
>>>
>>> The indexing might just need a bit of time, how long did you wait?
>>>
>>> Jörn
>>>
>>> On 04/26/2013 12:33 PM, William Colen wrote:
>>>>   From command line you can specify memory using
>>>>
>>>> MAVEN_OPTS="-Xmx4048m"
>>>>
>>>> You can also set it as JVM arguments if you are using from the API:
>>>>
>>>> java -Xmx4048m ...
>>>>
>>>>
>>>>
>>>> On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
>>>> svetoslav.marinov@findwise.com> wrote:
>>>>
>>>>> I use the API. Can one specify the memory size via the command line?
>>>>>I
>>>>> think the default there is 1024M? At 8G memory during "computing
>>>>>event
>>>>> counts...", at 16G during indexing: "Computing event counts...  done.
>>>>> 50153300 events
>>>>>           IndexingŠ"
>>>>>
>>>>> Svetoslav
>>>>>
>>>>> On 2013-04-26 09:12, "Jörn Kottmann" <kottmann@gmail.com> wrote:
>>>>>
>>>>>> On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>>>>>>> I'm wondering what is the max size (if such exists) for training
a
>>>>>>> NER
>>>>>>> model? I have a corpus of 2 600 000 sentences annotated with
just
>>>>>>>one
>>>>>>> category, 310M in size. However, the training never finishes
­ 8G
>>>>>>> memory
>>>>>>> resulted in java out of memory exception, and when I increased
it
>>>>>>>to
>>>>>>> 16G
>>>>>>> it just died with no error message.
>>>>>> Do you use the command line interface or the API for the training?
>>>>>> At which stage of the training did you get the out of memory
>>>>>> exception?
>>>>>> Where did it just die when you used 16G of memory (maybe do a
>>>>>>jstack)
>>>>>> ?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>
>>>
>
>

Mime
View raw message