manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Amazon CloudSearch Connector question
Date Tue, 09 Feb 2016 14:13:40 GMT
This is a puzzle; the only way this could occur is if some of the records
being produced generated absolutely no JSON.  Since there is an ID and a
type record for all of them I can't see how this could happen.  So we must
be adding records for documents that don't exist somehow?  I'll have to
look into it.

Karl

On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com>
wrote:

> Hi,
>
> The patch worked and now at least the POST has content. Amazon is
> responding with a Parsing Error though.
>
> I logged the message before it gets posted to Amazon and it's not a valid
> JSON, it had extra commas and parenthesis characters when concatenating
> records. Don't know if this is an issue on my setup or the JSONArrayReader.
>
> [{
> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
> "type": "add",
> "fields": {
>  <record fields>
> }
> }, , {
> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
> "type": "add",
> "fields": {
>  <record fields>
> }
> }, , , , , , , , , , , , , , , , {
> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
> "type": "add",
> "fields": {
>  <record fields>
> }
> }, , , ]
>
> Thanks,
>
> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Thanks! I'll apply it and let you know how it goes.
>>
>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our code,
>>> not Commons-IO, but Commons-IO changed things so that it tweaked it.
>>>
>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I have chased this down to a completely broken Apache Commons-IO
>>>> library.  It no longer works with the JSONReader objects in ManifoldCF at
>>>> all, and refuses to read anything from them.  Unfortunately I can't change
>>>> versions of that library because other things depend upon it. So I'll need
>>>> to write my own code to replace its functionality.  That will take some
>>>> amount of time to do.
>>>>
>>>> This probably happened the last time our dependencies were updated.  My
>>>> apologies.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Thanks,
>>>>>
>>>>> Don't know if it'll help, but removing the usage of JSONObjectReader
>>>>> on addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>>>> instead of using the JSONArrayReader on flushDocuments, changed the error
I
>>>>> was getting from Amazon.
>>>>>
>>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>>
>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, I'm debugging away, and I can confirm that no data is getting
>>>>>> through.  I'll have to open a ticket and create a patch when I find
the
>>>>>> problem.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon
is
>>>>>>>> unhappy about the JSON format we are sending it.  The deprecation
message
>>>>>>>> is probably a strong clue.  I'll experiment here with logging
document
>>>>>>>> contents so that I can give you further advice.  Stay tuned.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch
>>>>>>>>> connector fails when sending the request to amazon cloudsearch:
>>>>>>>>>
>>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the
outer message
>>>>>>>>> field] Encountered unexpected end of file"}], "adds":
0, "__type":
>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered
unexpected end of
>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>
>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread)
-
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> If you can possibly include a snippet of the JSON
you are seeing
>>>>>>>>>> on the Amazon end, that would be great.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>
>>>>>>>>>>> I take it that it is the body string that is
not coming out,
>>>>>>>>>>> correct?  Do all the other JSON fields look reasonable?
 Does the body
>>>>>>>>>>> clause exist and is just empty, or is it not
there at all?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz
<
>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> When running a copy of the job, but with
SOLR as a target, I'm
>>>>>>>>>>>> seeing the expected content being posted
to SOLR, so it may not be an issue
>>>>>>>>>>>> with TIKA. After adding some more logging
to the CloudSearch connector, I
>>>>>>>>>>>> think the data is getting lost just before
passing it to the
>>>>>>>>>>>> DocumentChunkManager, which inserts the empty
records to the DB. Could it
>>>>>>>>>>>> be that the JSONObjectReader doesn't like
my data?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd try to reproduce as much of the pipeline
as possible using
>>>>>>>>>>>>> a solr output connection.  If you include
the tika extractor in the
>>>>>>>>>>>>> pipeline, you will want to configure
the solr connection to not use the
>>>>>>>>>>>>> extracting update handler.  There's a
checkbox on the Schema tab you need
>>>>>>>>>>>>> to uncheck for that.  But if you do that
you can see what is being sent to
>>>>>>>>>>>>> Solr pretty exactly; it all gets logged
in the INFO messages dumped to solr
>>>>>>>>>>>>> log.  This should help you figure out
if the problem is your tika
>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please give this a try and let me know
what happens.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan
Pablo Diaz-Vaz <
>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've successfully sent data to FileSystems
and SOLR, but for
>>>>>>>>>>>>>> Amazon CloudSearch I'm seeing that
only empty messages are being sent to my
>>>>>>>>>>>>>> domain. I think this may be an issue
on how I've setup the TIKA Extractor
>>>>>>>>>>>>>> Transformation or the field mapping.
I think the Database where the records
>>>>>>>>>>>>>> are supposed to be stored before
flushing to Amazon, is storing empty
>>>>>>>>>>>>>> content.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've tried to find documentation
on how to setup the TIKA
>>>>>>>>>>>>>> Transformation, but I haven't been
able to find any.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If someone could provide an example
of a job setup to send
>>>>>>>>>>>>>> from a FileSystem to CloudSearch,
that'd be great!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Mime
View raw message