manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Amazon CloudSearch Connector question
Date Mon, 08 Feb 2016 21:51:41 GMT
Ok, I have a patch.  It's actually pretty tiny; the bug is in our code, not
Commons-IO, but Commons-IO changed things so that it tweaked it.

I've created a ticket (CONNECTORS-1271) and attached the patch to it.

Thanks!
Karl


On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <daddywri@gmail.com> wrote:

> I have chased this down to a completely broken Apache Commons-IO library.
> It no longer works with the JSONReader objects in ManifoldCF at all, and
> refuses to read anything from them.  Unfortunately I can't change versions
> of that library because other things depend upon it. So I'll need to write
> my own code to replace its functionality.  That will take some amount of
> time to do.
>
> This probably happened the last time our dependencies were updated.  My
> apologies.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Thanks,
>>
>> Don't know if it'll help, but removing the usage of JSONObjectReader on
>> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>> instead of using the JSONArrayReader on flushDocuments, changed the error I
>> was getting from Amazon.
>>
>> Maybe those objects are failing on parsing the content to JSON.
>>
>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>> through.  I'll have to open a ticket and create a patch when I find the
>>> problem.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Thank you very much.
>>>>
>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is
>>>>> unhappy about the JSON format we are sending it.  The deprecation message
>>>>> is probably a strong clue.  I'll experiment here with logging document
>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>
>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>>>>> fails when sending the request to amazon cloudsearch:
>>>>>>
>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected
end of
>>>>>> file\"] }", "deletes": 0}'
>>>>>>
>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> If you can possibly include a snippet of the JSON you are seeing
on
>>>>>>> the Amazon end, that would be great.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> More likely this is a bug.
>>>>>>>>
>>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>>> correct?  Do all the other JSON fields look reasonable? 
Does the body
>>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> When running a copy of the job, but with SOLR as a target,
I'm
>>>>>>>>> seeing the expected content being posted to SOLR, so
it may not be an issue
>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch
connector, I
>>>>>>>>> think the data is getting lost just before passing it
to the
>>>>>>>>> DocumentChunkManager, which inserts the empty records
to the DB. Could it
>>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Juan,
>>>>>>>>>>
>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible
using a
>>>>>>>>>> solr output connection.  If you include the tika
extractor in the pipeline,
>>>>>>>>>> you will want to configure the solr connection to
not use the extracting
>>>>>>>>>> update handler.  There's a checkbox on the Schema
tab you need to uncheck
>>>>>>>>>> for that.  But if you do that you can see what is
being sent to Solr pretty
>>>>>>>>>> exactly; it all gets logged in the INFO messages
dumped to solr log.  This
>>>>>>>>>> should help you figure out if the problem is your
tika configuration or not.
>>>>>>>>>>
>>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz
<
>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I've successfully sent data to FileSystems and
SOLR, but for
>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty
messages are being sent to my
>>>>>>>>>>> domain. I think this may be an issue on how I've
setup the TIKA Extractor
>>>>>>>>>>> Transformation or the field mapping. I think
the Database where the records
>>>>>>>>>>> are supposed to be stored before flushing to
Amazon, is storing empty
>>>>>>>>>>> content.
>>>>>>>>>>>
>>>>>>>>>>> I've tried to find documentation on how to setup
the TIKA
>>>>>>>>>>> Transformation, but I haven't been able to find
any.
>>>>>>>>>>>
>>>>>>>>>>> If someone could provide an example of a job
setup to send from
>>>>>>>>>>> a FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>> Full Stack Developer - MC+A Chile
>>>>>> +56 9 84265890
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>

Mime
View raw message