manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Pablo Diaz-Vaz <jpdiaz...@mcplusa.com>
Subject Re: Amazon CloudSearch Connector question
Date Tue, 09 Feb 2016 14:43:08 GMT
I'm using the quick start, I'll try to do a fresh start.

On Tue, Feb 9, 2016 at 11:42 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Juan,
>
> It occurs to me that you may have records in the document chunk table that
> were corrupted by the earlier version of the connector, and that is what is
> being sent.  Are you using the quick-start example, or Postgres?  If
> postgres, I'd recommend just deleting all rows in the document chunk table.
>
> Karl
>
>
> On Tue, Feb 9, 2016 at 9:13 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> This is a puzzle; the only way this could occur is if some of the records
>> being produced generated absolutely no JSON.  Since there is an ID and a
>> type record for all of them I can't see how this could happen.  So we must
>> be adding records for documents that don't exist somehow?  I'll have to
>> look into it.
>>
>> Karl
>>
>> On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Hi,
>>>
>>> The patch worked and now at least the POST has content. Amazon is
>>> responding with a Parsing Error though.
>>>
>>> I logged the message before it gets posted to Amazon and it's not a
>>> valid JSON, it had extra commas and parenthesis characters when
>>> concatenating records. Don't know if this is an issue on my setup or
>>> the JSONArrayReader.
>>>
>>> [{
>>> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
>>> "type": "add",
>>> "fields": {
>>>  <record fields>
>>> }
>>> }, , {
>>> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
>>> "type": "add",
>>> "fields": {
>>>  <record fields>
>>> }
>>> }, , , , , , , , , , , , , , , , {
>>> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
>>> "type": "add",
>>> "fields": {
>>>  <record fields>
>>> }
>>> }, , , ]
>>>
>>> Thanks,
>>>
>>> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Thanks! I'll apply it and let you know how it goes.
>>>>
>>>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our
>>>>> code, not Commons-IO, but Commons-IO changed things so that it tweaked
it.
>>>>>
>>>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it.
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have chased this down to a completely broken Apache Commons-IO
>>>>>> library.  It no longer works with the JSONReader objects in ManifoldCF
at
>>>>>> all, and refuses to read anything from them.  Unfortunately I can't
change
>>>>>> versions of that library because other things depend upon it. So
I'll need
>>>>>> to write my own code to replace its functionality.  That will take
some
>>>>>> amount of time to do.
>>>>>>
>>>>>> This probably happened the last time our dependencies were updated.
>>>>>> My apologies.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Don't know if it'll help, but removing the usage of JSONObjectReader
>>>>>>> on addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
>>>>>>> instead of using the JSONArrayReader on flushDocuments, changed
the error I
>>>>>>> was getting from Amazon.
>>>>>>>
>>>>>>> Maybe those objects are failing on parsing the content to JSON.
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, I'm debugging away, and I can confirm that no data is
getting
>>>>>>>> through.  I'll have to open a ticket and create a patch when
I find the
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you very much.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds
like Amazon is
>>>>>>>>>> unhappy about the JSON format we are sending it.
 The deprecation message
>>>>>>>>>> is probably a strong clue.  I'll experiment here
with logging document
>>>>>>>>>> contents so that I can give you further advice. 
Stay tuned.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz
<
>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm actually not seeing anything on Amazon. The
CloudSearch
>>>>>>>>>>> connector fails when sending the request to amazon
cloudsearch:
>>>>>>>>>>>
>>>>>>>>>>> AmazonCloudSearch: Error sending document chunk
0: '{"status":
>>>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*:
Use the outer message
>>>>>>>>>>> field] Encountered unexpected end of file"}],
"adds": 0, "__type":
>>>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered
unexpected end of
>>>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>>>
>>>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification
thread) -
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If you can possibly include a snippet of
the JSON you are
>>>>>>>>>>>> seeing on the Amazon end, that would be great.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I take it that it is the body string
that is not coming out,
>>>>>>>>>>>>> correct?  Do all the other JSON fields
look reasonable?  Does the body
>>>>>>>>>>>>> clause exist and is just empty, or is
it not there at all?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan
Pablo Diaz-Vaz <
>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When running a copy of the job, but
with SOLR as a target,
>>>>>>>>>>>>>> I'm seeing the expected content being
posted to SOLR, so it may not be an
>>>>>>>>>>>>>> issue with TIKA. After adding some
more logging to the CloudSearch
>>>>>>>>>>>>>> connector, I think the data is getting
lost just before passing it to the
>>>>>>>>>>>>>> DocumentChunkManager, which inserts
the empty records to the DB. Could it
>>>>>>>>>>>>>> be that the JSONObjectReader doesn't
like my data?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl
Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd try to reproduce as much
of the pipeline as possible
>>>>>>>>>>>>>>> using a solr output connection.
 If you include the tika extractor in the
>>>>>>>>>>>>>>> pipeline, you will want to configure
the solr connection to not use the
>>>>>>>>>>>>>>> extracting update handler.  There's
a checkbox on the Schema tab you need
>>>>>>>>>>>>>>> to uncheck for that.  But if
you do that you can see what is being sent to
>>>>>>>>>>>>>>> Solr pretty exactly; it all gets
logged in the INFO messages dumped to solr
>>>>>>>>>>>>>>> log.  This should help you figure
out if the problem is your tika
>>>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please give this a try and let
me know what happens.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM,
Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've successfully sent data
to FileSystems and SOLR, but
>>>>>>>>>>>>>>>> for Amazon CloudSearch I'm
seeing that only empty messages are being sent
>>>>>>>>>>>>>>>> to my domain. I think this
may be an issue on how I've setup the TIKA
>>>>>>>>>>>>>>>> Extractor Transformation
or the field mapping. I think the Database where
>>>>>>>>>>>>>>>> the records are supposed
to be stored before flushing to Amazon, is storing
>>>>>>>>>>>>>>>> empty content.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've tried to find documentation
on how to setup the TIKA
>>>>>>>>>>>>>>>> Transformation, but I haven't
been able to find any.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If someone could provide
an example of a job setup to send
>>>>>>>>>>>>>>>> from a FileSystem to CloudSearch,
that'd be great!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>>> Full Stack Developer - MC+A
Chile
>>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Mime
View raw message