manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From silvio.r.me...@quantentunnel.de
Subject Questing regarding Tika text extraction and elasticsearch
Date Sun, 15 May 2016 14:21:02 GMT
<html><head></head><body><div style="font-family: Verdana;font-size:
12.0px;"><div>
<div>Hi Apache ManifoldCF user list</div>

<div>&nbsp;</div>

<div>I&rsquo;m experimenting with Apache ManifoldCF 2.3 which I use to index the
network Windows shares of our company. I&rsquo;m using Elasticsearch 1.7.4, Apache ManifoldCF
2.3 with MS Active Directory as authority source. &nbsp;<br/>
I defined a job with the following connection configuration comprising the following chain
of transformations (order in the list indicates the order of the transformations):</div>

<div><br/>
1.&nbsp;&nbsp; &nbsp;Repository connection (MS Network Share)<br/>
2.&nbsp;&nbsp; &nbsp;Allowed documents<br/>
3.&nbsp;&nbsp; &nbsp;Tika extractor<br/>
4.&nbsp;&nbsp; &nbsp;Metadata adjuster<br/>
5.&nbsp;&nbsp; &nbsp;Elasticsearch</div>

<div>&nbsp;</div>

<div>I do this because I don&rsquo;t want to store the original document inside
the elasticsearch index but only the extracted text of the document. This works so far. However,
there are numerous documents which cause an exception of the following kind when being&nbsp;
analyzed and sent to the indexer by Apache ManifoldCF. Note that the exceptions happens in
the Elastic search analyzer:</div>

<div>&nbsp;</div>

<div>[2016-03-16 22:22:43,884][DEBUG][action.index&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
] [Tefral the Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: Failed
to execute [index {[sharein<br/>
dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
source[{&quot;access_permission:extract_for_access<br/>
ibility&quot; : &quot;true&quot;,&quot;dcterms:created&quot; : &quot;2016-03-02T13:03:47Z&quot;,&quot;access_permission:can_modify&quot;
: &quot;true&quot;,&quot;access_permission:modify_annotations&quot; : &quot;true&quot;,&quot;Creation-Date&quot;
: &quot;2016-03-02T1<br/>
3:03:47Z&quot;,&quot;fileLastModified&quot; : &quot;2016-03-02T13:03:37.433Z&quot;,&quot;access_permission:fill_in_form&quot;
: &quot;true&quot;,&quot;created&quot; : &quot;Wed Mar 02 14:03:47 CET
2016&quot;,&quot;stream_size&quot; : &quot;52067&quot;,&quot;dc:format&quot;
:<br/>
&nbsp;&quot;application&#92;/pdf; version=1.4&quot;,&quot;access_permission:can_print&quot;
: &quot;true&quot;,&quot;stream_name&quot; : &quot;M&#9500;&ntilde;useTastaturen
2.3.16 - Kopie.pdf&quot;,&quot;xmp:CreatorTool&quot; : &quot;Canon iR-ADV
C5250&nbsp; PDF&quot;,&quot;resourc<br/>
eName&quot; : &quot;M&#9500;&ntilde;useTastaturen 2.3.16 - Kopie.pdf&quot;,&quot;fileCreatedOn&quot;
: &quot;2016-03-16T21:22:24.085Z&quot;,&quot;access_permission:assemble_document&quot;
: &quot;true&quot;,&quot;meta:creation-date&quot; : &quot;2016-03-02T13:03:<br/>
47Z&quot;,&quot;lastModified&quot; : &quot;Wed Mar 02 14:03:37 CET 2016&quot;,&quot;pdf:PDFVersion&quot;
: &quot;1.4&quot;,&quot;X-Parsed-By&quot; : &quot;org.apache.tika.parser.DefaultParser&quot;,&quot;shareName&quot;
: &quot;AppDevData&#36;&quot;,&quot;access_permission:<br/>
can_print_degraded&quot; : &quot;true&quot;,&quot;xmpTPg:NPages&quot;
: &quot;1&quot;,&quot;createdOn&quot; : &quot;Wed Mar 16 22:22:24 CET
2016&quot;,&quot;pdf:encrypted&quot; : &quot;false&quot;,&quot;access_permission:extract_content&quot;
: &quot;true&quot;,&quot;producer&quot; :<br/>
&quot;Adobe PSL 1.2e for Canon &quot;,&quot;attributes&quot; : &quot;32&quot;,&quot;Content-Type&quot;
: &quot;applica-tion&#92;/pdf&quot;,&quot;allow_token_document&quot; :
[&quot;LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152&quot;,&quot;LDAPConn:S<br/>
-1-5-21-1751174259-1996115066-1435642685-16153&quot;,&quot;LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894&quot;],&quot;deny_token_document&quot;
: &quot;LDAPConn:DEAD_AUTHORITY&quot;,&quot;allow_token_share&quot; : &quot;<br/>
__nosecurity__&quot;,&quot;deny_token_share&quot; : &quot;__nosecurity__&quot;,&quot;allow_token_parent&quot;
: &quot;__nosecurity__&quot;,&quot;deny_token_parent&quot; : &quot;__nosecurity__&quot;,&quot;content&quot;
: &quot;&quot;}]}]<br/>
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [_source]<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction&#36;PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction&#36;PrimaryPhase&#36;1.doRun(TransportShardReplicationOperationAction.java:440)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at java.util.concurrent.ThreadPoolExecutor&#36;Worker.run(ThreadPoolExecutor.java:617)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at java.lang.Thread.run(Thread.java:745)<br/>
Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse content to map<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ... 11 more<br/>
Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Illegal unquoted character
((CTRL-CHAR, code 0)): has to be escaped using backslash to be included in string va<br/>
lue<br/>
&nbsp;at [Source: [B@5b774e8b; line: 1, column: 1145]<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ... 14 more</div>

<div>&nbsp;</div>

<div>This happens for documents of different types/extension, such as pdfs as well as
xlsx, etc. It seems that Tika sometimes does not remove special characters as the null character
0x0000. The presence of the special characters causes Elasticsearch to omit the indexing of
the document. Thus the document is not indexed at all, as&nbsp; special characters need
to be escaped when handed over as a JSON request. Is there a way to work around the problem
with the existing functionality of Apache ManifoldCF?</div>

<div>&nbsp;</div>

<div>Regards<br/>
Silvio</div>

<div>&nbsp;</div>
</div></div></body></html>

Mime
View raw message