manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: How to extract text content and index in elastic-search
Date Fri, 06 Oct 2017 12:47:22 GMT
Hi Dileepa,

MCF passes content through its processing chain as binary.  It's up to the
output connection configuration to decide if the output should be rendered
as text or binary, and it is there that a different decision would need to
be made.

IIRC there's a flag you can set that chooses between binary indexing (using
the mapper attachment) and text (which doesn't do that).  But I don't know
enough about ES to know whether this works properly with later versions of
ES, since ES is infamous for not maintaining backwards compatibility
between releases.  Can anyone else answer this question?

Karl


On Fri, Oct 6, 2017 at 8:39 AM, Dileepa Jayakody <dileepajayakody@gmail.com>
wrote:

> Guys, I'm using the latest 2.8.1 release.
>
> Thanks
>
> On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <
> dileepajayakody@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm trying out a small demo, with a file system repository connector and
>> elastic search output connector to extract spreadsheet documents and index.
>> I've also added tika transform connector in the job.
>>
>> When I run the documents get indexed in elastic-search but the content is
>> been indexed in binary.
>>
>> See below the indexed content in ES. Can I please know how to extract the
>> spread-sheet content to text format here?
>> Even for a text file, I see the content is been indexed as binary.
>> Is there a configuration I need to do here to get the text content
>> extracted and indexed in ES?
>>
>> {
>>         "_index": "test",
>>         "_type": "generictype",
>>         "_id": "file:/home/dileepa/Documents/
>> hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
>>         "_score": 1,
>>         "_source": {
>>           "stream_size": "101613",
>>           "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
>>           "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
>>           "protected": "false",
>>           "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
>>           "uri": "/home/dileepa/Documents/hackathon/test_data/MI -
>> Project2 - Estimation v1.0.xlsx",
>>           "Content-Type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>>           "content_type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>>           "allow_token_document": "__nosecurity__",
>>           "deny_token_document": "__nosecurity__",
>>           "allow_token_share": "__nosecurity__",
>>           "deny_token_share": "__nosecurity__",
>>           "allow_token_parent": "__nosecurity__",
>>           "deny_token_parent": "__nosecurity__",
>>           "file": {
>>             "_content_type": "application/vnd.openxmlformat
>> s-officedocument.spreadsheetml.sheet",
>>             "_name": "MI - Project2 - Estimation v1.0.xlsx",
>>             "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzC
>> glTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0a
>> W9uYWwgaJlYWxpMAkwCTAJ....."
>>         }
>>       }
>>     ]
>>   }
>> }
>>
>> Thanks,
>> Dileepa
>>
>
>

Mime
View raw message