lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
Date Fri, 22 Aug 2014 12:12:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106759#comment-14106759
] 

ASF subversion and git services commented on LUCENE-5400:
---------------------------------------------------------

Commit 1619773 from [~sarowe@syr.edu] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1619773 ]

LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer
tokenize extremely slowly over long sequences of text partially matching certain grammar rules.
 The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting
in much, much faster tokenization for these text sequences. (merged trunk r1619730)

> Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow
tokenization
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5400
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.5
>            Reporter: Chris Geeringh
>            Assignee: Steve Rowe
>
> This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not
sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace
tokenizer improved indexing speed, and I never got the issue again.
> I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr,
and have finally narrowed down the problem to this code, which affects many/all(?) versions
of Solr.
> When the thread hits this issue it uses 100% CPU, restarting the node which has the error
allows indexing to continue until hit again. Here is thread dump:
> http-bio-8080-exec-45 (201)
>     org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken​(UAX29URLEmailTokenizerImpl.java:4343)
>     org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken​(UAX29URLEmailTokenizer.java:147)
>     org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken​(FilteringTokenFilter.java:82)
>     org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken​(LowerCaseFilter.java:54)
>     org.apache.lucene.index.DocInverterPerField.processFields​(DocInverterPerField.java:174)
>     org.apache.lucene.index.DocFieldProcessor.processDocument​(DocFieldProcessor.java:248)
>     org.apache.lucene.index.DocumentsWriterPerThread.updateDocument​(DocumentsWriterPerThread.java:253)
>     org.apache.lucene.index.DocumentsWriter.updateDocument​(DocumentsWriter.java:453)
>     org.apache.lucene.index.IndexWriter.updateDocument​(IndexWriter.java:1517)
>     org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:217)
>     org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69)
>     org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:583)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:719)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:449)
>     org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:89)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:151)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:131)
>     org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:221)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:116)
>     org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:186)
>     org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:112)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:158)
>     org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:99)
>     org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:58)
>     org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:92)
>     org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:74)
>     org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135)
>     org.apache.solr.core.SolrCore.execute​(SolrCore.java:1859)
>     org.apache.solr.servlet.SolrDispatchFilter.execute​(SolrDispatchFilter.java:703)
>     org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:406)
>     org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:195)
>     org.apache.catalina.core.ApplicationFilterChain.internalDoFilter​(ApplicationFilterChain.java:243)
>     org.apache.catalina.core.ApplicationFilterChain.doFilter​(ApplicationFilterChain.java:210)
>     org.apache.catalina.core.StandardWrapperValve.invoke​(StandardWrapperValve.java:222)
>     org.apache.catalina.core.StandardContextValve.invoke​(StandardContextValve.java:123)
>     org.apache.catalina.core.StandardHostValve.invoke​(StandardHostValve.java:171)
>     org.apache.catalina.valves.ErrorReportValve.invoke​(ErrorReportValve.java:99)
>     org.apache.catalina.valves.AccessLogValve.invoke​(AccessLogValve.java:953)
>     org.apache.catalina.core.StandardEngineValve.invoke​(StandardEngineValve.java:118)
>     org.apache.catalina.connector.CoyoteAdapter.service​(CoyoteAdapter.java:408)
>     org.apache.coyote.http11.AbstractHttp11Processor.process​(AbstractHttp11Processor.java:1023)
>     org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process​(AbstractProtocol.java:589)
>     org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run​(JIoEndpoint.java:312)
>     java.util.concurrent.ThreadPoolExecutor.runWorker​(Unknown Source)
>     java.util.concurrent.ThreadPoolExecutor$Worker.run​(Unknown Source)
>     java.lang.Thread.run​(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message