manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Meier <silvio.r.me...@quantentunnel.de>
Subject Re: Questing regarding Tika text extraction and elasticsearch
Date Mon, 16 May 2016 17:14:20 GMT
Hi Karl

Thanks for the fast response and the patch. I'll patch the version that I have. Will the patch
be included in the next official release of Apache ManifoldCF?

Regards
Silvio


On 15.05.2016 18:37, Karl Wright wrote:
> Here's the patch.  Relatively short.
>
> Karl
>
>
> On Sun, May 15, 2016 at 12:27 PM, Karl Wright <daddywri@gmail.com 
> <mailto:daddywri@gmail.com>> wrote:
>
>     There is a way apparently you are allowed to encode this, and I
>     have a patch, but JIRA is down.  If it doesn't come back up soon
>     I'll email you the patch.
>
>     Karl
>
>
>     On Sun, May 15, 2016 at 12:11 PM, Karl Wright <daddywri@gmail.com
>     <mailto:daddywri@gmail.com>> wrote:
>
>         Hi Silvio,
>
>         This sounds like a problem with the way the Elastic Search
>         connector is forming JSON.  The spec is silent on control
>         characters:
>
>         http://rfc7159.net/rfc7159#rfc.section.8.1
>
>         ... so we just embed those in strings.  But it sounds like
>         ElasticSearch's JSON parser is not so happy with them.
>
>         If we can find an encoding that satisfies everyone, we can
>         change the code to do what is needed.  Maybe "\0" for null, etc?
>
>         Karl
>
>
>         On Sun, May 15, 2016 at 10:21 AM,
>         <silvio.r.meier@quantentunnel.de
>         <mailto:silvio.r.meier@quantentunnel.de>> wrote:
>
>             Hi Apache ManifoldCF user list
>             I’m experimenting with Apache ManifoldCF 2.3 which I use
>             to index the network Windows shares of our company. I’m
>             using Elasticsearch 1.7.4, Apache ManifoldCF 2.3 with MS
>             Active Directory as authority source.
>             I defined a job with the following connection
>             configuration comprising the following chain of
>             transformations (order in the list indicates the order of
>             the transformations):
>
>             1.    Repository connection (MS Network Share)
>             2.    Allowed documents
>             3.    Tika extractor
>             4.    Metadata adjuster
>             5.    Elasticsearch
>             I do this because I don’t want to store the original
>             document inside the elasticsearch index but only the
>             extracted text of the document. This works so far.
>             However, there are numerous documents which cause an
>             exception of the following kind when being analyzed and
>             sent to the indexer by Apache ManifoldCF. Note that the
>             exceptions happens in the Elastic search analyzer:
>             [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral
>             the Surveyor] [shareindex][2],
>             node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: Failed to
>             execute [index {[sharein
>             dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>             source[{"access_permission:extract_for_access
>             ibility" : "true","dcterms:created" :
>             "2016-03-02T13:03:47Z","access_permission:can_modify" :
>             "true","access_permission:modify_annotations" :
>             "true","Creation-Date" : "2016-03-02T1
>             3:03:47Z","fileLastModified" :
>             "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>             "true","created" : "Wed Mar 02 14:03:47 CET
>             2016","stream_size" : "52067","dc:format" :
>              "application\/pdf;
>             version=1.4","access_permission:can_print" :
>             "true","stream_name" : "MäuseTastaturen 2.3.16 -
>             Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250
>             PDF","resourc
>             eName" : "MäuseTastaturen 2.3.16 -
>             Kopie.pdf","fileCreatedOn" :
>             "2016-03-16T21:22:24.085Z","access_permission:assemble_document"
>             : "true","meta:creation-date" : "2016-03-02T13:03:
>             47Z","lastModified" : "Wed Mar 02 14:03:37 CET
>             2016","pdf:PDFVersion" : "1.4","X-Parsed-By" :
>             "org.apache.tika.parser.DefaultParser","shareName" :
>             "AppDevData$","access_permission:
>             can_print_degraded" : "true","xmpTPg:NPages" :
>             "1","createdOn" : "Wed Mar 16 22:22:24 CET
>             2016","pdf:encrypted" :
>             "false","access_permission:extract_content" :
>             "true","producer" :
>             "Adobe PSL 1.2e for Canon ","attributes" :
>             "32","Content-Type" :
>             "applica-tion\/pdf","allow_token_document" :
>             ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>             -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>             : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>             __nosecurity__","deny_token_share" :
>             "__nosecurity__","allow_token_parent" :
>             "__nosecurity__","deny_token_parent" :
>             "__nosecurity__","content" : ""}]}]
>             org.elasticsearch.index.mapper.MapperParsingException:
>             failed to parse [_source]
>                     at
>             org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>                     at
>             org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>                     at
>             org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>                     at
>             org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>                     at
>             org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>                     at
>             org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>                     at
>             org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>                     at
>             org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>                     at
>             org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>                     at
>             java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>                     at
>             java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>                     at java.lang.Thread.run(Thread.java:745)
>             Caused by: org.elasticsearch.ElasticsearchParseException:
>             Failed to parse content to map
>                     at
>             org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>                     at
>             org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>                     at
>             org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>                     at
>             org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>                     ... 11 more
>             Caused by:
>             org.elasticsearch.common.jackson.core.JsonParseException:
>             Illegal unquoted character ((CTRL-CHAR, code 0)): has to
>             be escaped using backslash to be included in string va
>             lue
>              at [Source: [B@5b774e8b; line: 1, column: 1145]
>                     at
>             org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>                     at
>             org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>                     at
>             org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>                     at
>             org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>                     at
>             org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>                     at
>             org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>                     at
>             org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>                     at
>             org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>                     ... 14 more
>             This happens for documents of different types/extension,
>             such as pdfs as well as xlsx, etc. It seems that Tika
>             sometimes does not remove special characters as the null
>             character 0x0000. The presence of the special characters
>             causes Elasticsearch to omit the indexing of the document.
>             Thus the document is not indexed at all, as  special
>             characters need to be escaped when handed over as a JSON
>             request. Is there a way to work around the problem with
>             the existing functionality of Apache ManifoldCF?
>             Regards
>             Silvio
>
>
>
>

Mime
View raw message