manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dileepa Jayakody <dileepajayak...@gmail.com>
Subject Re: How to extract text content and index in elastic-search
Date Fri, 06 Oct 2017 12:39:49 GMT
Guys, I'm using the latest 2.8.1 release.

Thanks

On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <dileepajayakody@gmail.com>
wrote:

> Hi All,
>
> I'm trying out a small demo, with a file system repository connector and
> elastic search output connector to extract spreadsheet documents and index.
> I've also added tika transform connector in the job.
>
> When I run the documents get indexed in elastic-search but the content is
> been indexed in binary.
>
> See below the indexed content in ES. Can I please know how to extract the
> spread-sheet content to text format here?
> Even for a text file, I see the content is been indexed as binary.
> Is there a configuration I need to do here to get the text content
> extracted and indexed in ES?
>
> {
>         "_index": "test",
>         "_type": "generictype",
>         "_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-%
> 20Project2%20-%20Estimation%20v1.0.xlsx",
>         "_score": 1,
>         "_source": {
>           "stream_size": "101613",
>           "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
>           "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
>           "protected": "false",
>           "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
>           "uri": "/home/dileepa/Documents/hackathon/test_data/MI -
> Project2 - Estimation v1.0.xlsx",
>           "Content-Type": "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet",
>           "content_type": "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet",
>           "allow_token_document": "__nosecurity__",
>           "deny_token_document": "__nosecurity__",
>           "allow_token_share": "__nosecurity__",
>           "deny_token_share": "__nosecurity__",
>           "allow_token_parent": "__nosecurity__",
>           "deny_token_parent": "__nosecurity__",
>           "file": {
>             "_content_type": "application/vnd.
> openxmlformats-officedocument.spreadsheetml.sheet",
>             "_name": "MI - Project2 - Estimation v1.0.xlsx",
>             "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCg
> lTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW
> 9uYWwgaJlYWxpMAkwCTAJ....."
>         }
>       }
>     ]
>   }
> }
>
> Thanks,
> Dileepa
>

Mime
View raw message