Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 77115200D19 for ; Fri, 6 Oct 2017 14:39:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7593E1609DF; Fri, 6 Oct 2017 12:39:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 94A881609D0 for ; Fri, 6 Oct 2017 14:39:54 +0200 (CEST) Received: (qmail 36807 invoked by uid 500); 6 Oct 2017 12:39:53 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 36797 invoked by uid 99); 6 Oct 2017 12:39:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Oct 2017 12:39:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id EB7C31808A7 for ; Fri, 6 Oct 2017 12:39:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id tdTmYojGOllD for ; Fri, 6 Oct 2017 12:39:51 +0000 (UTC) Received: from mail-qt0-f172.google.com (mail-qt0-f172.google.com [209.85.216.172]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0CB485FC80 for ; Fri, 6 Oct 2017 12:39:51 +0000 (UTC) Received: by mail-qt0-f172.google.com with SMTP id f15so31103816qtf.7 for ; Fri, 06 Oct 2017 05:39:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=MU2pNJvCcbh2DDZoUqGVIXopZyYfuJQV10kZsVDy5+Q=; b=op4Wi+IUze6Cylf8cgdc/NfKCPAPfDH0bXHKBqWvexocOxINYte9iVjcdfPljybGtJ OV3KZiy/FVGWEL9ZIhdtQo2HT0iWNmSH2f9LwIHnvzY2JjumOnOuKbW1X4AK2owTkVh1 ie4TyquRjuulJ2db7WUbhSeahX1f9ruMwDkL9reCXKOARqI0ECEzPa0xyilUHPS1JaDI uPI0KEHSZxSbOgE7KV1ISnOXXqIcPcRw627xah/Kk0RzIAW/+zgkyj/FPy5NA/RUlS3y y7QJX1zJ+jWwP7e/3EG0ng71DQ4D+YjD5/Do2TZpPQfZFenjjq0QbHLt3cl2JS8j/Qgr JWuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=MU2pNJvCcbh2DDZoUqGVIXopZyYfuJQV10kZsVDy5+Q=; b=JEYhwf4W33yeGX0u2jRliQ6DjKL/ATyMfZlEqpIsSZD2uNXLiB++tGdkwhMFcDXbp8 P7aSHKf7hWj5KqUWe3VpxIpzisatKnOwehlxspbo3ECEVVawxvDlThyubxqxFljgPHFj k4KVShzfDBLhGU5gndEjMIRi+M5az9PJnc/NcP34OsDta7UDClTKazl09nNXkogH/Ix+ mH5ukGSSG23uefmps7iz7B1umhxSvXA4Z6BHAnS+LGUfsoz+3uqD+4UTvnM+mgWnDkZt Wxlr83LlqiF/xECVRGLMZukU6w0SN+aEN43OIyfs3QuwlFe++K19auqQoNRYfolDqwRX 4eiw== X-Gm-Message-State: AMCzsaUswJ6Ws8T8CAXOB/MvlXvM6kxlx+zUhP3eegPIHraH0zVI0j0k FJyk6m+h0+a8MN9knFezkUemZ3VTFlq5Emm2oCM= X-Google-Smtp-Source: AOwi7QBMauGyKKsTyx64gnvqdcco5JRGo4x+4tosIojJzTbkydQgXQB+dQQxM7dZJxR4JePDrhZH4IlghgWGzTWUXos= X-Received: by 10.237.63.85 with SMTP id q21mr2789120qtf.30.1507293589857; Fri, 06 Oct 2017 05:39:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.180.10 with HTTP; Fri, 6 Oct 2017 05:39:49 -0700 (PDT) In-Reply-To: References: From: Dileepa Jayakody Date: Fri, 6 Oct 2017 18:09:49 +0530 Message-ID: Subject: Re: How to extract text content and index in elastic-search To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary="001a11455ee0fee9b5055ae02245" archived-at: Fri, 06 Oct 2017 12:39:55 -0000 --001a11455ee0fee9b5055ae02245 Content-Type: text/plain; charset="UTF-8" Guys, I'm using the latest 2.8.1 release. Thanks On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody wrote: > Hi All, > > I'm trying out a small demo, with a file system repository connector and > elastic search output connector to extract spreadsheet documents and index. > I've also added tika transform connector in the job. > > When I run the documents get indexed in elastic-search but the content is > been indexed in binary. > > See below the indexed content in ES. Can I please know how to extract the > spread-sheet content to text format here? > Even for a text file, I see the content is been indexed as binary. > Is there a configuration I need to do here to get the text content > extracted and indexed in ES? > > { > "_index": "test", > "_type": "generictype", > "_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-% > 20Project2%20-%20Estimation%20v1.0.xlsx", > "_score": 1, > "_source": { > "stream_size": "101613", > "X-Parsed-By": "org.apache.tika.parser.DefaultParser", > "stream_name": "MI - Project2 - Estimation v1.0.xlsx", > "protected": "false", > "resourceName": "MI - Project2 - Estimation v1.0.xlsx", > "uri": "/home/dileepa/Documents/hackathon/test_data/MI - > Project2 - Estimation v1.0.xlsx", > "Content-Type": "application/vnd.openxmlformats-officedocument. > spreadsheetml.sheet", > "content_type": "application/vnd.openxmlformats-officedocument. > spreadsheetml.sheet", > "allow_token_document": "__nosecurity__", > "deny_token_document": "__nosecurity__", > "allow_token_share": "__nosecurity__", > "deny_token_share": "__nosecurity__", > "allow_token_parent": "__nosecurity__", > "deny_token_parent": "__nosecurity__", > "file": { > "_content_type": "application/vnd. > openxmlformats-officedocument.spreadsheetml.sheet", > "_name": "MI - Project2 - Estimation v1.0.xlsx", > "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCg > lTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW > 9uYWwgaJlYWxpMAkwCTAJ....." > } > } > ] > } > } > > Thanks, > Dileepa > --001a11455ee0fee9b5055ae02245 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Guys, I'm using the latest 2.8.1 release.

=
Thanks

On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <= ;dileepajaya= kody@gmail.com> wrote:
Hi All,

I'm trying o= ut a small demo, with a file system repository connector and elastic search= output connector to extract spreadsheet documents and index.
I've also added tika transform connector in the job.
When I run the documents get indexed in elastic-search but the conte= nt is been indexed in binary.

See below the indexed content in= ES. Can I please know how to extract the spread-sheet content to text form= at here?
Even for a text file, I see the content is been indexed = as binary.
Is there a configuration I need to do here to get the = text content extracted and indexed in ES?

{
= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "_index": "test&q= uot;,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "_type": &quo= t;generictype",
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "_i= d": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-= %20Project2%20-%20Estimation%20v1.0.xlsx",
=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "_score": 1,
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 "_source": {
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "stream_size": "101613",=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "X-Parsed-B= y": "org.apache.tika.parser.DefaultParser",
=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "stream_name": &= quot;MI - Project2 - Estimation v1.0.xlsx",
=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "protected": "false",=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "resourceNa= me": "MI - Project2 - Estimation v1.0.xlsx",
=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "uri": "/home/dil= eepa/Documents/hackathon/test_data/MI - Project2 - Estimation v1.0.xls= x",
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "Co= ntent-Type": "application/vnd.openxmlformats-officedocument.= spreadsheetml.sheet",
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 "content_type": "application/vnd.openxm= lformats-officedocument.spreadsheetml.sheet",
=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "allow_token_document": &= quot;__nosecurity__",
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 "deny_token_document": "__nosecurity__",
= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "allow_token_sh= are": "__nosecurity__",
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 "deny_token_share": "__nosecurity__&qu= ot;,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "allow_= token_parent": "__nosecurity__",
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "deny_token_parent": "__nosec= urity__",
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &q= uot;file": {
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 "_content_type": "application/vnd.openxmlf= ormats-officedocument.spreadsheetml.sheet",
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "_name": "M= I - Project2 - Estimation v1.0.xlsx",
=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "_content": "RG= V2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnM= gYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
= =C2=A0=C2=A0=C2=A0 ]
=C2=A0 }
}

Thanks,
Dileepa

--001a11455ee0fee9b5055ae02245--