lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: String bytes can be at most 32766 characters in length?
Date Wed, 02 Sep 2015 15:29:54 GMT
Hi Erick,

Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
that, which is the indexing stops working once I put in the following
De-Duplication code in solrconfig.xml. The problem seems to be with this <str
name="update.chain">dedupe</str> line.

  <requestHandler name="/update" class="solr.UpdateRequestHandler">
  <lst name="defaults">
<str name="update.chain">dedupe</str>
  </lst>
  </requestHandler>


    <updateRequestProcessorChain name="dedupe">
  <processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">content</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
  </processor>
</updateRequestProcessorChain>


Regards,
Edwin

On 2 September 2015 at 23:10, Erick Erickson <erickerickson@gmail.com>
wrote:

> Yes, that is an intentional limit for the size of a single token,
> which strings are.
>
> Why not use deduplication? See:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> You don't have to replace the existing documents, and Solr will
> compute a hash that can be used to identify identical documents
> and you can use_that_.
>
> Best
> Erick
>
> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
> <edwinyeozl@gmail.com> wrote:
> > Hi,
> >
> > I would like to check, is the string bytes must be at most 32766
> characters
> > in length?
> >
> > I'm trying to do a copyField of my rich-text documents content to a field
> > with fieldType=string to try out my getting distinct result for content,
> as
> > there are several documents with the exact same content, and we only want
> > to list one of them during searching.
> >
> > However, I get the following errors in some of the documents when I tried
> > to index them with the copyField. Some of my documents are quite large in
> > size, and there is a possibility that it exceed 32766 characters. Is
> there
> > any other ways to overcome this problem?
> >
> >
> > org.apache.solr.common.SolrException: Exception writing document id
> > collection1_polymer100 to the index; possible analysis error.
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
> > at
> >
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> > at
> >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
> > at
> >
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
> > at
> >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> > at
> >
> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
> > at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > at org.eclipse.jetty.server.Server.handle(Server.java:497)
> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> > at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> > at
> >
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.IllegalArgumentException: Document contains at least
> > one immense term in field="signature" (whose UTF8 encoding is longer than
> > the max length 32766), all of which were skipped.  Please correct the
> > analyzer to not produce such terms.  The prefix of the first immense term
> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56,
> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...',
> > original message: bytes can be at most 32766 in length; got 49960
> > at
> >
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670)
> > at
> >
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
> > at
> >
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
> > at
> >
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
> > at
> >
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
> > at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
> > ... 38 more
> > Caused by:
> > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
> bytes
> > can be at most 32766 in length; got 49960
> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
> > at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
> > at
> >
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660)
> > ... 45 more
> >
> >
> > Regards,
> > Edwin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message