manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Questing regarding Tika text extraction and elasticsearch
Date Sun, 15 May 2016 16:37:09 GMT
Here's the patch.  Relatively short.

Karl


On Sun, May 15, 2016 at 12:27 PM, Karl Wright <daddywri@gmail.com> wrote:

> There is a way apparently you are allowed to encode this, and I have a
> patch, but JIRA is down.  If it doesn't come back up soon I'll email you
> the patch.
>
> Karl
>
>
> On Sun, May 15, 2016 at 12:11 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Silvio,
>>
>> This sounds like a problem with the way the Elastic Search connector is
>> forming JSON.  The spec is silent on control characters:
>>
>> http://rfc7159.net/rfc7159#rfc.section.8.1
>>
>> ... so we just embed those in strings.  But it sounds like
>> ElasticSearch's JSON parser is not so happy with them.
>>
>> If we can find an encoding that satisfies everyone, we can change the
>> code to do what is needed.  Maybe "\0" for null, etc?
>>
>> Karl
>>
>>
>> On Sun, May 15, 2016 at 10:21 AM, <silvio.r.meier@quantentunnel.de>
>> wrote:
>>
>>> Hi Apache ManifoldCF user list
>>>
>>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
>>> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
>>> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
>>> I defined a job with the following connection configuration comprising
>>> the following chain of transformations (order in the list indicates the
>>> order of the transformations):
>>>
>>> 1.    Repository connection (MS Network Share)
>>> 2.    Allowed documents
>>> 3.    Tika extractor
>>> 4.    Metadata adjuster
>>> 5.    Elasticsearch
>>>
>>> I do this because I don’t want to store the original document inside the
>>> elasticsearch index but only the extracted text of the document. This works
>>> so far. However, there are numerous documents which cause an exception of
>>> the following kind when being  analyzed and sent to the indexer by Apache
>>> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer:
>>>
>>> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
>>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
>>> Failed to execute [index {[sharein
>>> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>>> source[{"access_permission:extract_for_access
>>> ibility" : "true","dcterms:created" :
>>> "2016-03-02T13:03:47Z","access_permission:can_modify" :
>>> "true","access_permission:modify_annotations" : "true","Creation-Date" :
>>> "2016-03-02T1
>>> 3:03:47Z","fileLastModified" :
>>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
>>> "52067","dc:format" :
>>>  "application\/pdf; version=1.4","access_permission:can_print" :
>>> "true","stream_name" : "MäuseTastaturen 2.3.16 -
>>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
>>> eName" : "MäuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
>>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
>>> "true","meta:creation-date" : "2016-03-02T13:03:
>>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
>>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
>>> "AppDevData$","access_permission:
>>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed
>>> Mar 16 22:22:24 CET 2016","pdf:encrypted" :
>>> "false","access_permission:extract_content" : "true","producer" :
>>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
>>> "applica-tion\/pdf","allow_token_document" :
>>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>>> __nosecurity__","deny_token_share" :
>>> "__nosecurity__","allow_token_parent" :
>>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
>>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>>> [_source]
>>>         at
>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>>>         at
>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>>>         at
>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>>>         at
>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>>>         at
>>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>>>         at
>>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>>>         at
>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>>>         at
>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>>>         at
>>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to
>>> parse content to map
>>>         at
>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>>>         at
>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>>>         at
>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>>>         at
>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>>>         ... 11 more
>>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
>>> backslash to be included in string va
>>> lue
>>>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>>>         at
>>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>>>         at
>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>>>         at
>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>>>         at
>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>>>         at
>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>>>         at
>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>>>         at
>>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>>>         at
>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>>>         ... 14 more
>>>
>>> This happens for documents of different types/extension, such as pdfs as
>>> well as xlsx, etc. It seems that Tika sometimes does not remove special
>>> characters as the null character 0x0000. The presence of the special
>>> characters causes Elasticsearch to omit the indexing of the document. Thus
>>> the document is not indexed at all, as  special characters need to be
>>> escaped when handed over as a JSON request. Is there a way to work around
>>> the problem with the existing functionality of Apache ManifoldCF?
>>>
>>> Regards
>>> Silvio
>>>
>>>
>>
>>
>

Mime
View raw message