manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Fetching output Elastic Search data in pipelines
Date Thu, 15 Mar 2018 10:34:29 GMT
I'm afraid you're going to have to debug this.

Can you figure out what is being sent to Elastic Search that it does not
like?  If you check the "use mapper attachment" box, the transmission of
the file content should include only base64 content.  Can you verify this
is happening?

The problem with Elastic Search support is that it changes quite often, and
in ways that are not backwards compatible.  The MCF team cannot keep up
with it.  This may be a case where it has changed and we need to do
something different.

The code in MCF that does the Base64 encoding is in ElasticSearchIndex:

>>>>>>
        if (useMapperAttachments && inputStream != null) {
          if(needComma){
            pw.print(",");
          }
          // I'm told this is not necessary: see CONNECTORS-690
          //pw.print("\"type\" : \"attachment\",");
          pw.print("\"file\" : {");
          String contentType = document.getMimeType();
          if (contentType != null)
            pw.print("\"_content_type\" :
"+jsonStringEscape(contentType)+",");
          String fileName = document.getFileName();
          if (fileName != null)
            pw.print("\"_name\" : "+jsonStringEscape(fileName)+",");
          // Since ES 1.0
          pw.print(" \"_content\" : \"");
          Base64 base64 = new Base64();
          base64.encodeStream(inputStream, pw);
          pw.print("\"}");
        }
<<<<<<

Note that the Base64 is included in JSON, and is quoted on both sides.
When you select "use mapper attachment" this is what should be being sent
to ES.  Is it?  If it is, why isn't ES accepting it?

Karl


On Thu, Mar 15, 2018 at 5:04 AM, Nikita Ahuja <nikita@smartshore.nl> wrote:

> Hi Karl,
>
> There is still problem with the same mapper attachment with the Elastic
> connector, even if the box is "checked". The same error still comes there.
>
>
>
>
>
> Please suggest a way out.
>
> Thanks and Regards,
> Nikita
>
>
>
> On Wed, Mar 7, 2018 at 6:32 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Nikita,
>>
>> You have not selected the "use mapper attachment" checkbox in the
>> configuration for the ES output connector.  But you are using it in Elastic
>> Search.  The ES output connector will not convert binary to base64 unless
>> you check that box.
>>
>> Karl
>>
>>
>> On Wed, Mar 7, 2018 at 6:18 AM, Nikita Ahuja <nikita@smartshore.nl>
>> wrote:
>>
>>> Hi Karl,
>>>
>>>
>>> This is not only for  Sharepoint it is same for FileShare, Sharepoint
>>> and Web crawler.
>>>
>>> For Elastic Search Output, following parameters are defined.
>>>
>>>
>>>
>>>
>>> In the simple history tab, following errors are there.
>>>
>>>
>>>
>>> Server exception like this comes down, every time it goes for the
>>> indexation:
>>>
>>>
>>> *Server exception:
>>> {"error":{"root_cause":[{"type":"exception","reason":"java.lang.IllegalArgumentException:
>>> java.lang.IllegalArgumentException: Illegal base64 character
>>> 3f","header":{"processor_type":"attachment"}}],"type":"exception","reason":"java.lang.IllegalArgumentException:
>>> java.lang.IllegalArgumentException: Illegal base64 character
>>> 3f","caused_by":{"type":"illegal_argument_exception","reason":"java.lang.IllegalArgumentException:
>>> Illegal base64 character
>>> 3f","caused_by":{"type":"illegal_argument_exception","reason":"Illegal
>>> base64 character
>>> 3f"}},"header":{"processor_type":"attachment"}},"status":500} *
>>>
>>>
>>>
>>> But if we don't define any value in the pipeline tab, it goes directly
>>> in the index. there is some problem with the code. Here I need to use
>>> different pipelines in the same index like for Website: web and for
>>> FileShare: file, etc.
>>>
>>>
>>> Thanks and Regards,
>>> Nikita
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 7, 2018 at 2:45 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> The downstream pipeline for a connector determines which mime types are
>>>> indexed and which are rejected.  If you look in the Simple History report
>>>> for one of the rejected SharePoint documents, there should be information
>>>> recorded about why it was rejected.  If there's no non-image documents at
>>>> all described from SharePoint, then the issue would have to be how the
>>>> SharePoint repository connection in the job is specified.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Wed, Mar 7, 2018 at 2:29 AM, Nikita Ahuja <nikita@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>>
>>>>> I am trying to ingest the data from website ans Sharepoint to Elastic
>>>>> Search output in different pipelines in same index.
>>>>>
>>>>> But the ManifoldCF is not able to ingest all the data. It only put
>>>>> image files present in the source to ElasticSearch output.
>>>>>
>>>>> Is there anything which is being missed?
>>>>>
>>>>>
>>>>> Please guide for a solution.
>>>>>
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message