manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shinichiro Abe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
Date Mon, 06 Jul 2015 19:05:05 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615489#comment-14615489
] 

Shinichiro Abe commented on CONNECTORS-1219:
--------------------------------------------

Thank you for the review. Added Maximumdocumentlength params and field, r1689479 to the branch.

It seems to me that isInteger() function at editconnection.jsp doesn't strictly check for
integer value IIUC, is it expected? Solr connector's max length check on the jsp could be
also passed to long value.
BTW, if it was used Integer.MAX_VALUE on the field, StringBuilder init would raise OOM when
adding big binary in the connection because char array exceeded max capacity.

And big binary was be able to reject to ingest by having max length, but I found another OOMs
which were caused by Lucene.

{noformat}
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$FieldData.<init>(CompressingTermVectorsWriter.java:157)
	at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$DocData.addField(CompressingTermVectorsWriter.java:106)
	at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.startField(CompressingTermVectorsWriter.java:287)
	at org.apache.lucene.index.TermVectorsConsumerPerField.finishDocument(TermVectorsConsumerPerField.java:81)
	at org.apache.lucene.index.TermVectorsConsumer.finishDocument(TermVectorsConsumer.java:110)
	at org.apache.lucene.index.TermsHash.finishDocument(TermsHash.java:93)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:316)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
	at org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321)
	at org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221)
{noformat}
I will add term_vector true|false option on the fields.
{noformat}
Caused by: java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:345)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.writeField(CompressingStoredFieldsWriter.java:297)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:361)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
	at org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321)
	at org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester
{noformat}
This OOM could be resolved by tika write limit.
 

> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch
>
>
> A output connector for Lucene local index directly, not via remote search engine. It
would be nice if we could use Lucene various API to the index directly, even though we could
do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification,
categorization, and tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message