Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@minotaur.apache.org Received: (qmail 35174 invoked from network); 26 Jan 2010 16:39:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jan 2010 16:39:06 -0000 Received: (qmail 57926 invoked by uid 500); 26 Jan 2010 16:39:05 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 57851 invoked by uid 500); 26 Jan 2010 16:39:05 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 57840 invoked by uid 99); 26 Jan 2010 16:39:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2010 16:39:05 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2010 16:38:56 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C07A0234C4B6 for ; Tue, 26 Jan 2010 08:38:34 -0800 (PST) Message-ID: <2059249637.40701264523914786.JavaMail.jira@brutus.apache.org> Date: Tue, 26 Jan 2010 16:38:34 +0000 (UTC) From: "Julien Coloos (JIRA)" To: solr-dev@lucene.apache.org Subject: [jira] Updated: (SOLR-1283) Mark Invalid error on indexing In-Reply-To: <57382092.1247687534851.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Coloos updated SOLR-1283: -------------------------------- Attachment: SOLR-1283.patch The issue is also happening in current trunk (revision 903234), with the class {{HTMLStripCharFilter}} (replacing deprecated {{HTMLStripReader}} it seems). Example of stacktrace: {noformat} 26 janv. 2010 16:02:56 org.apache.solr.common.SolrException log GRAVE: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63) at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:172) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748) at java.io.Reader.read(Reader.java:122) at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:77) at org.apache.lucene.analysis.ISOLatin1AccentFilter.incrementToken(ISOLatin1AccentFilter.java:43) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:383) at org.apache.lucene.analysis.ISOLatin1AccentFilter.next(ISOLatin1AccentFilter.java:64) at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:379) at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:318) at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225) at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38) at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:406) at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:97) at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:83) at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:321) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:781) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:764) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2630) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2602) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1317) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:723) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) {noformat} After a quick code review, it seems this one is due to the {{peek}} function which can read a byte from the input stream, while not incrementing the {{numRead}} variable (as done in the {{next}} function): functions checking whether _read ahead_ limit was reached rely on {{numRead}}. The exception can then be triggered when reading exceeds the _read ahead_ limit, as for example with a big document containing a malformed processing instruction like {noformat} ????? ... (anything except '?>') {noformat} Note: the issue is triggered here because {{readProcessingInstruction}} calls {{peek}} whenever the character '{{?}}' was found (to check whether it is followed by '{{>}}'). You will find attached a patch to fix the issue, as well as an updated JUnit test (which actually only checks for the malformed processing instruction, maybe you will find a more general test to perform on the {{next}}/{{peek}} functions). Regards > Mark Invalid error on indexing > ------------------------------ > > Key: SOLR-1283 > URL: https://issues.apache.org/jira/browse/SOLR-1283 > Project: Solr > Issue Type: Bug > Affects Versions: 1.3 > Environment: Ubuntu 8.04, Sun Java 6 > Reporter: solrize > Attachments: SOLR-1283.patch > > > When indexing large (1 megabyte) documents I get a lot of exceptions with stack traces like the below. It happens both in the Solr 1.3 release and in the July 9 1.4 nightly. I believe this to NOT be the same issue as SOLR-42. I found some further discussion on solr-user: http://www.nabble.com/IOException:-Mark-invalid-while-analyzing-HTML-td17052153.html > In that discussion, Grant asked the original poster to open a Jira issue, but I didn't see one so I'm opening one; please feel free to merge or close if it's redundant. > My stack trace follows. > Jul 15, 2009 8:36:42 AM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/update params={} status=500 QTime=3 > Jul 15, 2009 8:36:42 AM org.apache.solr.common.SolrException log > SEVERE: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(BufferedReader.java:485) > at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) > at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728) > at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742) > at java.io.Reader.read(Reader.java:123) > at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:108) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:178) > at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84) > at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53) > at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:347) > at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159) > at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) > at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) > at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765) > at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:748) > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2512) > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2484) > at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240) > at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) > at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140) > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1292) > at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) > at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) > at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) > at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) > at org.mortbay.jetty.Server.handle(Server.java:285) > at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) > at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) > at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) > at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.