Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D6BDF200827 for ; Sun, 15 May 2016 18:27:35 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D548B160968; Sun, 15 May 2016 16:27:35 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A9A431602C0 for ; Sun, 15 May 2016 18:27:34 +0200 (CEST) Received: (qmail 21085 invoked by uid 500); 15 May 2016 16:27:33 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 21074 invoked by uid 99); 15 May 2016 16:27:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 May 2016 16:27:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3FFB418055E for ; Sun, 15 May 2016 16:27:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id wN0cBy9Rh4Tj for ; Sun, 15 May 2016 16:27:30 +0000 (UTC) Received: from mail-io0-f173.google.com (mail-io0-f173.google.com [209.85.223.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CC2FF5F1F7 for ; Sun, 15 May 2016 16:27:29 +0000 (UTC) Received: by mail-io0-f173.google.com with SMTP id f89so185112154ioi.0 for ; Sun, 15 May 2016 09:27:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=IXOlUrAQ08DMBXabVZHqVfAxlXifhk15urNgeNo7Y34=; b=fQW/EcweswBKt9Kdu7a3a/fTatw70xVcwQuJJ+KRgx52WSQSnhHYKOEQuZLwA4ImhG 2fpfEhPVitLZteacIXB6zm61FElw2W0VfXeEeuMZ8+iat3sV4O2TngkmGbrmmsgOo6bh SgYQPJ0OrD90oE6b85l6Uj70u6uZHSdFlvprOyuQ1/IQO9MjIzpwQAGN+yej0EYdXuap SwGL7EGqa7Qlzg/4gXJWs/CWjpeMtNAmJFJZVVw2JDVefUYxuBGh+z4HJckGq3Qa+fxD jTwVlESQhlvLfNdmsu8TVlebp6AH64U9syC1VsmKwRNxRGMGwpd2ncL157+RtrZ+lcEq Rqlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=IXOlUrAQ08DMBXabVZHqVfAxlXifhk15urNgeNo7Y34=; b=mGMPx/PBztcJpjxXQvmmq9+Egv1zNk8dGJiCfv59qepoAFO2mXqqu/y1NLsUILaL9c pyFM2FOjyZy+yALwBpG1Pma9sQeFOXaZm6PBg5ieErPKtG77OeTclRjNjbTUTxn89oud PYEvAAbAbm4qljzdhn+x4qw/OlHoIxWZnJNc0TOHRNpRtQkw/EZ/YVcwUqcwVHPh0e2v KHQ7Y7VPdHz5lOiPZFp+IyjVzvPVo3z53y/MWuF/2zYfhMTcIYAuXUK5frM+h5diaPUN 14EieFMw02EsbY7vC2StGSyYh5NVIn10hswmkgnmX8nwoQJpxJ+0SQ+DFVULFxQypNEx 7g5w== X-Gm-Message-State: AOPr4FXM/t55iE7upW9Rj/zFaTW0GafYOZFQHCgf2SHz3uIJ1YGb5vr7in4s1X3PiJP4liGIyfyTRdT7VblP4g== MIME-Version: 1.0 X-Received: by 10.107.59.195 with SMTP id i186mr18404956ioa.36.1463329642646; Sun, 15 May 2016 09:27:22 -0700 (PDT) Received: by 10.107.9.197 with HTTP; Sun, 15 May 2016 09:27:22 -0700 (PDT) In-Reply-To: References: Date: Sun, 15 May 2016 12:27:22 -0400 Message-ID: Subject: Re: Questing regarding Tika text extraction and elasticsearch From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a114f80a08a15a60532e3fb9f archived-at: Sun, 15 May 2016 16:27:36 -0000 --001a114f80a08a15a60532e3fb9f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable There is a way apparently you are allowed to encode this, and I have a patch, but JIRA is down. If it doesn't come back up soon I'll email you the patch. Karl On Sun, May 15, 2016 at 12:11 PM, Karl Wright wrote: > Hi Silvio, > > This sounds like a problem with the way the Elastic Search connector is > forming JSON. The spec is silent on control characters: > > http://rfc7159.net/rfc7159#rfc.section.8.1 > > ... so we just embed those in strings. But it sounds like ElasticSearch'= s > JSON parser is not so happy with them. > > If we can find an encoding that satisfies everyone, we can change the cod= e > to do what is needed. Maybe "\0" for null, etc? > > Karl > > > On Sun, May 15, 2016 at 10:21 AM, wrote= : > >> Hi Apache ManifoldCF user list >> >> I=E2=80=99m experimenting with Apache ManifoldCF 2.3 which I use to inde= x the >> network Windows shares of our company. I=E2=80=99m using Elasticsearch 1= .7.4, >> Apache ManifoldCF 2.3 with MS Active Directory as authority source. >> I defined a job with the following connection configuration comprising >> the following chain of transformations (order in the list indicates the >> order of the transformations): >> >> 1. Repository connection (MS Network Share) >> 2. Allowed documents >> 3. Tika extractor >> 4. Metadata adjuster >> 5. Elasticsearch >> >> I do this because I don=E2=80=99t want to store the original document in= side the >> elasticsearch index but only the extracted text of the document. This wo= rks >> so far. However, there are numerous documents which cause an exception o= f >> the following kind when being analyzed and sent to the indexer by Apach= e >> ManifoldCF. Note that the exceptions happens in the Elastic search analy= zer: >> >> [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral the >> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]= : >> Failed to execute [index {[sharein >> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/inde= xingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf], >> source[{"access_permission:extract_for_access >> ibility" : "true","dcterms:created" : >> "2016-03-02T13:03:47Z","access_permission:can_modify" : >> "true","access_permission:modify_annotations" : "true","Creation-Date" : >> "2016-03-02T1 >> 3:03:47Z","fileLastModified" : >> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" : >> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" : >> "52067","dc:format" : >> "application\/pdf; version=3D1.4","access_permission:can_print" : >> "true","stream_name" : "M=E2=94=9C=C3=B1useTastaturen 2.3.16 - >> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250 PDF","resourc >> eName" : "M=E2=94=9C=C3=B1useTastaturen 2.3.16 - Kopie.pdf","fileCreated= On" : >> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" : >> "true","meta:creation-date" : "2016-03-02T13:03: >> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" : >> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName"= : >> "AppDevData$","access_permission: >> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed Ma= r >> 16 22:22:24 CET 2016","pdf:encrypted" : >> "false","access_permission:extract_content" : "true","producer" : >> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" : >> "applica-tion\/pdf","allow_token_document" : >> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S >> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-17511= 74259-1996115066-1435642685-7894"],"deny_token_document" >> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : " >> __nosecurity__","deny_token_share" : >> "__nosecurity__","allow_token_parent" : >> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]= }] >> org.elasticsearch.index.mapper.MapperParsingException: failed to parse >> [_source] >> at >> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFi= eldMapper.java:411) >> at >> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(Sourc= eFieldMapper.java:240) >> at >> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:= 540) >> at >> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:= 493) >> at >> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:49= 2) >> at >> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrim= ary(TransportIndexAction.java:192) >> at >> org.elasticsearch.action.support.replication.TransportShardReplicationOp= erationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperat= ionAction.java:574) >> at >> org.elasticsearch.action.support.replication.TransportShardReplicationOp= erationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction= .java:440) >> at >> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRu= nnable.java:36) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav= a:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja= va:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to pars= e >> content to map >> at >> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHe= lper.java:130) >> at >> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHe= lper.java:81) >> at >> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateFie= ld(SourceFieldMapper.java:274) >> at >> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFi= eldMapper.java:401) >> ... 11 more >> Caused by: org.elasticsearch.common.jackson.core.JsonParseException: >> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped usin= g >> backslash to be included in string va >> lue >> at [Source: [B@5b774e8b; line: 1, column: 1145] >> at >> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonPar= ser.java:1487) >> at >> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportErro= r(ParserMinimalBase.java:518) >> at >> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquo= tedSpace(ParserMinimalBase.java:482) >> at >> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishS= tring2(UTF8StreamJsonParser.java:2357) >> at >> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishS= tring(UTF8StreamJsonParser.java:2287) >> at >> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(= UTF8StreamJsonParser.java:286) >> at >> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXCont= entParser.java:86) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readVal= ue(AbstractXContentParser.java:293) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap= (AbstractXContentParser.java:275) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrd= eredMap(AbstractXContentParser.java:258) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrde= red(AbstractXContentParser.java:213) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrde= redAndClose(AbstractXContentParser.java:228) >> at >> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHe= lper.java:125) >> ... 14 more >> >> This happens for documents of different types/extension, such as pdfs as >> well as xlsx, etc. It seems that Tika sometimes does not remove special >> characters as the null character 0x0000. The presence of the special >> characters causes Elasticsearch to omit the indexing of the document. Th= us >> the document is not indexed at all, as special characters need to be >> escaped when handed over as a JSON request. Is there a way to work aroun= d >> the problem with the existing functionality of Apache ManifoldCF? >> >> Regards >> Silvio >> >> > > --001a114f80a08a15a60532e3fb9f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
There is a way apparently you are allowed to encode this, = and I have a patch, but JIRA is down.=C2=A0 If it doesn't come back up = soon I'll email you the patch.

Karl


On Sun,= May 15, 2016 at 12:11 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Silvio,
<= br>
This sounds like a problem with the way the Elastic Search co= nnector is forming JSON.=C2=A0 The spec is silent on control characters:

http://rfc7159.net/rfc7159#rfc.section.8.1

... so we just embed those in strings.=C2=A0 But it s= ounds like ElasticSearch's JSON parser is not so happy with them.
=

If we can find an encoding that satisfies everyone, we = can change the code to do what is needed.=C2=A0 Maybe "\0" for nu= ll, etc?

Karl


O= n Sun, May 15, 2016 at 10:21 AM, <silvio.r.meier@quantentun= nel.de> wrote:
Hi Apache ManifoldCF user list
=C2=A0
I=E2=80=99m experimenting with Apache ManifoldCF 2.3 which I use to in= dex the network Windows shares of our company. I=E2=80=99m using Elasticsea= rch 1.7.4, Apache ManifoldCF 2.3 with MS Active Directory as authority sour= ce. =C2=A0
I defined a job with the following connection configuration comprising the = following chain of transformations (order in the list indicates the order o= f the transformations):

1.=C2=A0=C2=A0 =C2=A0Repository connection (MS Network Share)
2.=C2=A0=C2=A0 =C2=A0Allowed documents
3.=C2=A0=C2=A0 =C2=A0Tika extractor
4.=C2=A0=C2=A0 =C2=A0Metadata adjuster
5.=C2=A0=C2=A0 =C2=A0Elasticsearch
=C2=A0
I do this because I don=E2=80=99t want to store the original document = inside the elasticsearch index but only the extracted text of the document.= This works so far. However, there are numerous documents which cause an ex= ception of the following kind when being=C2=A0 analyzed and sent to the ind= exer by Apache ManifoldCF. Note that the exceptions happens in the Elastic = search analyzer:
=C2=A0
[2016-03-16 22:22:43,884][DEBUG][action.index=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ] [Tefral the Surveyor] [s= hareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: Failed to exe= cute [index {[sharein
dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexin= gtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf], source[{"a= ccess_permission:extract_for_access
ibility" : "true","dcterms:created" : "2016-0= 3-02T13:03:47Z","access_permission:can_modify" : "true&= quot;,"access_permission:modify_annotations" : "true",&= quot;Creation-Date" : "2016-03-02T1
3:03:47Z","fileLastModified" : "2016-03-02T13:03:37.433= Z","access_permission:fill_in_form" : "true","= ;created" : "Wed Mar 02 14:03:47 CET 2016","stream_size= " : "52067","dc:format" :
=C2=A0"application\/pdf; version=3D1.4","access_permission:c= an_print" : "true","stream_name" : "M=E2=94= =9C=C3=B1useTastaturen 2.3.16 - Kopie.pdf","xmp:CreatorTool"= : "Canon iR-ADV C5250=C2=A0 PDF","resourc
eName" : "M=E2=94=9C=C3=B1useTastaturen 2.3.16 - Kopie.pdf",= "fileCreatedOn" : "2016-03-16T21:22:24.085Z","acce= ss_permission:assemble_document" : "true","meta:creatio= n-date" : "2016-03-02T13:03:
47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016&quo= t;,"pdf:PDFVersion" : "1.4","X-Parsed-By" : &= quot;org.apache.tika.parser.DefaultParser","shareName" : &qu= ot;AppDevData$","access_permission:
can_print_degraded" : "true","xmpTPg:NPages" : &qu= ot;1","createdOn" : "Wed Mar 16 22:22:24 CET 2016"= ,"pdf:encrypted" : "false","access_permission:extr= act_content" : "true","producer" :
"Adobe PSL 1.2e for Canon ","attributes" : "32&quo= t;,"Content-Type" : "applica-tion\/pdf","allow_tok= en_document" : ["LDAPConn:S-1-5-21-1751174259-1996115066-14356426= 85-16152","LDAPConn:S
-1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-2= 1-1751174259-1996115066-1435642685-7894"],"deny_token_document&qu= ot; : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : &= quot;
__nosecurity__","deny_token_share" : "__nosecurity__&qu= ot;,"allow_token_parent" : "__nosecurity__","deny_= token_parent" : "__nosecurity__","content" : "= ;"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [_so= urce]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.mappe= r.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.mappe= r.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.mappe= r.DocumentMapper.parse(DocumentMapper.java:540)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.mappe= r.DocumentMapper.parse(DocumentMapper.java:493)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.shard= .IndexShard.prepareIndex(IndexShard.java:492)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.action.inde= x.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:19= 2)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.action.supp= ort.replication.TransportShardReplicationOperationAction$PrimaryPhase.perfo= rmOnPrimary(TransportShardReplicationOperationAction.java:574)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.action.supp= ort.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doR= un(TransportShardReplicationOperationAction.java:440)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.util= .concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.util.concurrent.ThreadPo= olExecutor.runWorker(ThreadPoolExecutor.java:1142)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.util.concurrent.ThreadPo= olExecutor$Worker.run(ThreadPoolExecutor.java:617)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.lang.Thread.run(Thread.j= ava:745)
Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse c= ontent to map
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.XContentHelper.convertToMap(XContentHelper.java:130)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.XContentHelper.convertToMap(XContentHelper.java:81)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.mappe= r.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.index.mappe= r.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ... 11 more
Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Illega= l unquoted character ((CTRL-CHAR, code 0)): has to be escaped using backsla= sh to be included in string va
lue
=C2=A0at [Source: [B@5b774e8b; line: 1, column: 1145]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.jack= son.core.JsonParser._constructError(JsonParser.java:1487)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.jack= son.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.jack= son.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:= 482)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.jack= son.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java= :2357)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.jack= son.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:= 2287)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.jack= son.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.json.JsonXContentParser.text(JsonXContentParser.java:86)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:2= 93)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275= )
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.j= ava:258)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:= 213)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentPars= er.java:228)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.elasticsearch.common.xcon= tent.XContentHelper.convertToMap(XContentHelper.java:125)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ... 14 more
=C2=A0
This happens for documents of different types/extension, such as pdfs = as well as xlsx, etc. It seems that Tika sometimes does not remove special = characters as the null character 0x0000. The presence of the special charac= ters causes Elasticsearch to omit the indexing of the document. Thus the do= cument is not indexed at all, as=C2=A0 special characters need to be escape= d when handed over as a JSON request. Is there a way to work around the pro= blem with the existing functionality of Apache ManifoldCF?
=C2=A0
Regards
Silvio
=C2=A0


--001a114f80a08a15a60532e3fb9f--