lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Hacker (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-4090) Indexing mangles documents under load
Date Fri, 16 Nov 2012 19:36:11 GMT
Jay Hacker created SOLR-4090:
--------------------------------

             Summary: Indexing mangles documents under load
                 Key: SOLR-4090
                 URL: https://issues.apache.org/jira/browse/SOLR-4090
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.0
         Environment: Red Hat 5.8 x86_64, Oracle JDK 1.7.0_09
            Reporter: Jay Hacker


Sometimes when indexing documents with multiple concurrent processes, I get intermittent data
corruption errors.  The data submitted for indexing is valid, but Solr will complain that
it is malformed and return an HTTP 500 or 400 error.  The errors are similar whether submitting
as XML, JSON, or CSV.  The problem has not occurred using a single process (i.e., one process
POSTing to Solr), is rare with four processes, and is pretty reproducible with 16 (or 64,
or 128).  I've seen this problem since at least Solr 1.4; we generally just use four threads
and hope for the best.  It seems a bit more common with Solr 4, so I decided to track it down.
 

I have dummy documents for the example ("collection1") schema that, when posted using many
simultaneous processes, often generate parsing errors.  I use something like this to pound
Solr with repeated POSTs of the same document:

{code}
yes docs.xml | xargs -P64 -I{} curl -s --data-binary @{} --header 'Content-type: text/xml'
'http://localhost:8983/solr/collection1/update'
{code}

Which fairly reliably gives this error:

{noformat}
SEVERE: org.apache.solr.common.SolrException: Unexpected '<'  in attribute value
 at [row,col {unknown-source}]: [2765,74]
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
        at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:722)
Caused by: com.ctc.wstx.exc.WstxParsingException: Unexpected '<'  in attribute value
 at [row,col {unknown-source}]: [2765,74]
        at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
        at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
        at com.ctc.wstx.sr.BasicStreamReader.parseNormalizedAttrValue(BasicStreamReader.java:1951)
        at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3037)
        at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936)
        at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
        at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:370)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:229)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
        ... 31 more
{noformat}


When posting the same data as JSON:

{code}
yes docs.json | xargs -P64 -I{} curl -s --data-binary @{} --header 'Content-type: application/json'
'http://localhost:8983/solr/collection1/update'
{code}

The errors look like this:

{noformat}
SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=3] multiple values encountered for
non multiValued field f0008_t: [388888888888888888888888888888888888888888888888888, 9888888888888888888888888888888888888888888888888888]
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:243)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:208)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:432)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:557)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
        at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:386)
        at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:111)
        at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95)
        at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
        at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:722)
{noformat}

I would like to emphasize that the stack traces are misleading: there is no problem with the
data. Often the exact same document will index just fine a thousand times before throwing
an error.  Somewhere between being submitted and parsed, the data is being occasionally corrupted.
 The errors are intermittent, but when the do happen, they often complain about exactly the
same byte in the input.

Here's what I know:

* Small POSTs, or single large documents, don't seem to trigger the problem; it seems to take
multiple documents totaling ~150K or more.
* More threads trigger the error more often.
* Restricting the JVM heap to say -Xmx384M seems to trigger the problem more often.
* Setting {{commit=true}} with each post seems to alleviate the problem.
* I can't seem to reproduce it with 3.6.1, or 4.1-2012-11-16_01-01-43 nightly -- maybe related
to [SOLR-3621|https://issues.apache.org/jira/browse/SOLR-3621]?
* To test if it was Jetty, I wrote a small servlet that just decodes JSON and pounded it under
both Jetty 8.1.2 and 8.1.7  without problems -- however, [SOLR-4031|https://issues.apache.org/jira/browse/SOLR-4031]
sounds awfully similar, and...
* If I unzip the 4.0 tarball, and replace jetty (example/{lib/,start.jar}) with the jetty
8.1.7 from 4.1 nightly, it seems to fix the problem.

I'm posting this in case anyone runs into similar problems, and meanwhile I will keep testing
4.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message