Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1977CD66B for ; Fri, 16 Nov 2012 19:38:14 +0000 (UTC) Received: (qmail 85955 invoked by uid 500); 16 Nov 2012 19:38:12 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 85859 invoked by uid 500); 16 Nov 2012 19:38:12 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 85848 invoked by uid 99); 16 Nov 2012 19:38:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Nov 2012 19:38:12 +0000 Date: Fri, 16 Nov 2012 19:38:12 +0000 (UTC) From: "Jay Hacker (JIRA)" To: dev@lucene.apache.org Message-ID: <1790246471.125416.1353094692698.JavaMail.jiratomcat@arcas> In-Reply-To: <133967939.125406.1353094572150.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (SOLR-4090) Indexing mangles documents under load MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Hacker updated SOLR-4090: ----------------------------- Attachment: docs.xml docs.json Here are the test documents that make Solr bork. > Indexing mangles documents under load > ------------------------------------- > > Key: SOLR-4090 > URL: https://issues.apache.org/jira/browse/SOLR-4090 > Project: Solr > Issue Type: Bug > Affects Versions: 4.0 > Environment: Red Hat 5.8 x86_64, Oracle JDK 1.7.0_09 > Reporter: Jay Hacker > Attachments: docs.json, docs.xml > > > Sometimes when indexing documents with multiple concurrent processes, I get intermittent data corruption errors. The data submitted for indexing is valid, but Solr will complain that it is malformed and return an HTTP 500 or 400 error. The errors are similar whether submitting as XML, JSON, or CSV. The problem has not occurred using a single process (i.e., one process POSTing to Solr), is rare with four processes, and is pretty reproducible with 16 (or 64, or 128). I've seen this problem since at least Solr 1.4; we generally just use four threads and hope for the best. It seems a bit more common with Solr 4, so I decided to track it down. > I have dummy documents for the example ("collection1") schema that, when posted using many simultaneous processes, often generate parsing errors. I use something like this to pound Solr with repeated POSTs of the same document: > {code} > yes docs.xml | xargs -P64 -I{} curl -s --data-binary @{} --header 'Content-type: text/xml' 'http://localhost:8983/solr/collection1/update' > {code} > Which fairly reliably gives this error: > {noformat} > SEVERE: org.apache.solr.common.SolrException: Unexpected '<' in attribute value > at [row,col {unknown-source}]: [2765,74] > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159) > at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) > at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) > at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) > at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) > at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) > at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) > at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) > at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) > at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) > at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) > at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) > at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) > at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) > at org.eclipse.jetty.server.Server.handle(Server.java:351) > at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) > at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) > at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) > at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) > at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) > at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) > at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) > at java.lang.Thread.run(Thread.java:722) > Caused by: com.ctc.wstx.exc.WstxParsingException: Unexpected '<' in attribute value > at [row,col {unknown-source}]: [2765,74] > at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630) > at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461) > at com.ctc.wstx.sr.BasicStreamReader.parseNormalizedAttrValue(BasicStreamReader.java:1951) > at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3037) > at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936) > at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) > at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:370) > at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:229) > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157) > ... 31 more > {noformat} > When posting the same data as JSON: > {code} > yes docs.json | xargs -P64 -I{} curl -s --data-binary @{} --header 'Content-type: application/json' 'http://localhost:8983/solr/collection1/update' > {code} > The errors look like this: > {noformat} > SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=3] multiple values encountered for non multiValued field f0008_t: [388888888888888888888888888888888888888888888888888, 9888888888888888888888888888888888888888888888888888] > at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:243) > at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73) > at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:208) > at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) > at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:432) > at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:557) > at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325) > at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:386) > at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:111) > at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95) > at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59) > at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) > at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) > at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) > at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) > at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) > at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) > at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) > at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) > at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) > at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) > at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) > at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) > at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) > at org.eclipse.jetty.server.Server.handle(Server.java:351) > at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) > at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) > at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) > at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) > at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) > at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) > at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) > at java.lang.Thread.run(Thread.java:722) > {noformat} > I would like to emphasize that the stack traces are misleading: there is no problem with the data. Often the exact same document will index just fine a thousand times before throwing an error. Somewhere between being submitted and parsed, the data is being occasionally corrupted. The errors are intermittent, but when the do happen, they often complain about exactly the same byte in the input. > Here's what I know: > * Small POSTs, or single large documents, don't seem to trigger the problem; it seems to take multiple documents totaling ~150K or more. > * More threads trigger the error more often. > * Restricting the JVM heap to say -Xmx384M seems to trigger the problem more often. > * Setting {{commit=true}} with each post seems to alleviate the problem. > * I can't seem to reproduce it with 3.6.1, or 4.1-2012-11-16_01-01-43 nightly -- maybe related to [SOLR-3621|https://issues.apache.org/jira/browse/SOLR-3621]? > * To test if it was Jetty, I wrote a small servlet that just decodes JSON and pounded it under both Jetty 8.1.2 and 8.1.7 without problems -- however, [SOLR-4031|https://issues.apache.org/jira/browse/SOLR-4031] sounds awfully similar, and... > * If I unzip the 4.0 tarball, and replace jetty (example/{lib/,start.jar}) with the jetty 8.1.7 from 4.1 nightly, it seems to fix the problem. > I'm posting this in case anyone runs into similar problems, and meanwhile I will keep testing 4.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org