manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Pablo Diaz-Vaz <jpdiaz...@mcplusa.com>
Subject Re: Amazon CloudSearch Connector question
Date Mon, 08 Feb 2016 22:17:20 GMT
Thanks! I'll apply it and let you know how it goes.

On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <daddywri@gmail.com> wrote:

> Ok, I have a patch.  It's actually pretty tiny; the bug is in our code,
> not Commons-IO, but Commons-IO changed things so that it tweaked it.
>
> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>
> Thanks!
> Karl
>
>
> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I have chased this down to a completely broken Apache Commons-IO
>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>> all, and refuses to read anything from them.  Unfortunately I can't change
>> versions of that library because other things depend upon it. So I'll need
>> to write my own code to replace its functionality.  That will take some
>> amount of time to do.
>>
>> This probably happened the last time our dependencies were updated.  My
>> apologies.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Thanks,
>>>
>>> Don't know if it'll help, but removing the usage of JSONObjectReader on
>>> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>>> was getting from Amazon.
>>>
>>> Maybe those objects are failing on parsing the content to JSON.
>>>
>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>> through.  I'll have to open a ticket and create a patch when I find the
>>>> problem.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Thank you very much.
>>>>>
>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>>> unhappy about the JSON format we are sending it.  The deprecation
message
>>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>
>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer
message
>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected
end of
>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>
>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If you can possibly include a snippet of the JSON you are
seeing on
>>>>>>>> the Amazon end, that would be great.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> More likely this is a bug.
>>>>>>>>>
>>>>>>>>> I take it that it is the body string that is not coming
out,
>>>>>>>>> correct?  Do all the other JSON fields look reasonable?
 Does the body
>>>>>>>>> clause exist and is just empty, or is it not there at
all?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> When running a copy of the job, but with SOLR as
a target, I'm
>>>>>>>>>> seeing the expected content being posted to SOLR,
so it may not be an issue
>>>>>>>>>> with TIKA. After adding some more logging to the
CloudSearch connector, I
>>>>>>>>>> think the data is getting lost just before passing
it to the
>>>>>>>>>> DocumentChunkManager, which inserts the empty records
to the DB. Could it
>>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>
>>>>>>>>>>> I'd try to reproduce as much of the pipeline
as possible using a
>>>>>>>>>>> solr output connection.  If you include the tika
extractor in the pipeline,
>>>>>>>>>>> you will want to configure the solr connection
to not use the extracting
>>>>>>>>>>> update handler.  There's a checkbox on the Schema
tab you need to uncheck
>>>>>>>>>>> for that.  But if you do that you can see what
is being sent to Solr pretty
>>>>>>>>>>> exactly; it all gets logged in the INFO messages
dumped to solr log.  This
>>>>>>>>>>> should help you figure out if the problem is
your tika configuration or not.
>>>>>>>>>>>
>>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz
<
>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I've successfully sent data to FileSystems
and SOLR, but for
>>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty
messages are being sent to my
>>>>>>>>>>>> domain. I think this may be an issue on how
I've setup the TIKA Extractor
>>>>>>>>>>>> Transformation or the field mapping. I think
the Database where the records
>>>>>>>>>>>> are supposed to be stored before flushing
to Amazon, is storing empty
>>>>>>>>>>>> content.
>>>>>>>>>>>>
>>>>>>>>>>>> I've tried to find documentation on how to
setup the TIKA
>>>>>>>>>>>> Transformation, but I haven't been able to
find any.
>>>>>>>>>>>>
>>>>>>>>>>>> If someone could provide an example of a
job setup to send from
>>>>>>>>>>>> a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Mime
View raw message