manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Amazon CloudSearch Connector question
Date Tue, 09 Feb 2016 14:43:48 GMT
Sure; please blow away the database instance first, and then you should be
all set.

Karl


On Tue, Feb 9, 2016 at 9:43 AM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com>
wrote:

> I'm using the quick start, I'll try to do a fresh start.
>
> On Tue, Feb 9, 2016 at 11:42 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Juan,
>>
>> It occurs to me that you may have records in the document chunk table
>> that were corrupted by the earlier version of the connector, and that is
>> what is being sent.  Are you using the quick-start example, or Postgres?
>> If postgres, I'd recommend just deleting all rows in the document chunk
>> table.
>>
>> Karl
>>
>>
>> On Tue, Feb 9, 2016 at 9:13 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> This is a puzzle; the only way this could occur is if some of the
>>> records being produced generated absolutely no JSON.  Since there is an ID
>>> and a type record for all of them I can't see how this could happen.  So we
>>> must be adding records for documents that don't exist somehow?  I'll have
>>> to look into it.
>>>
>>> Karl
>>>
>>> On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> The patch worked and now at least the POST has content. Amazon is
>>>> responding with a Parsing Error though.
>>>>
>>>> I logged the message before it gets posted to Amazon and it's not a
>>>> valid JSON, it had extra commas and parenthesis characters when
>>>> concatenating records. Don't know if this is an issue on my setup or
>>>> the JSONArrayReader.
>>>>
>>>> [{
>>>> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555",
>>>> "type": "add",
>>>> "fields": {
>>>>  <record fields>
>>>> }
>>>> }, , {
>>>> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C",
>>>> "type": "add",
>>>> "fields": {
>>>>  <record fields>
>>>> }
>>>> }, , , , , , , , , , , , , , , , {
>>>> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB",
>>>> "type": "add",
>>>> "fields": {
>>>>  <record fields>
>>>> }
>>>> }, , , ]
>>>>
>>>> Thanks,
>>>>
>>>> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> Thanks! I'll apply it and let you know how it goes.
>>>>>
>>>>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, I have a patch.  It's actually pretty tiny; the bug is in our
>>>>>> code, not Commons-IO, but Commons-IO changed things so that it tweaked
it.
>>>>>>
>>>>>> I've created a ticket (CONNECTORS-1271) and attached the patch to
it.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have chased this down to a completely broken Apache Commons-IO
>>>>>>> library.  It no longer works with the JSONReader objects in ManifoldCF
at
>>>>>>> all, and refuses to read anything from them.  Unfortunately I
can't change
>>>>>>> versions of that library because other things depend upon it.
So I'll need
>>>>>>> to write my own code to replace its functionality.  That will
take some
>>>>>>> amount of time to do.
>>>>>>>
>>>>>>> This probably happened the last time our dependencies were updated.
>>>>>>> My apologies.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Don't know if it'll help, but removing the usage of
>>>>>>>> JSONObjectReader on addOrReplaceDocumentWithException and
posting to Amazon
>>>>>>>> chunk-by-chunk instead of using the JSONArrayReader on flushDocuments,
>>>>>>>> changed the error I was getting from Amazon.
>>>>>>>>
>>>>>>>> Maybe those objects are failing on parsing the content to
JSON.
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, I'm debugging away, and I can confirm that no data
is getting
>>>>>>>>> through.  I'll have to open a ticket and create a patch
when I find the
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you very much.
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds
like Amazon is
>>>>>>>>>>> unhappy about the JSON format we are sending
it.  The deprecation message
>>>>>>>>>>> is probably a strong clue.  I'll experiment here
with logging document
>>>>>>>>>>> contents so that I can give you further advice.
 Stay tuned.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz
<
>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm actually not seeing anything on Amazon.
The CloudSearch
>>>>>>>>>>>> connector fails when sending the request
to amazon cloudsearch:
>>>>>>>>>>>>
>>>>>>>>>>>> AmazonCloudSearch: Error sending document
chunk 0: '{"status":
>>>>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*:
Use the outer message
>>>>>>>>>>>> field] Encountered unexpected end of file"}],
"adds": 0, "__type":
>>>>>>>>>>>> "#DocumentServiceException", "message": "{
[\"Encountered unexpected end of
>>>>>>>>>>>> file\"] }", "deletes": 0}'
>>>>>>>>>>>>
>>>>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification
thread) -
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you can possibly include a snippet
of the JSON you are
>>>>>>>>>>>>> seeing on the Amazon end, that would
be great.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl
Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> More likely this is a bug.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I take it that it is the body string
that is not coming out,
>>>>>>>>>>>>>> correct?  Do all the other JSON fields
look reasonable?  Does the body
>>>>>>>>>>>>>> clause exist and is just empty, or
is it not there at all?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan
Pablo Diaz-Vaz <
>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When running a copy of the job,
but with SOLR as a target,
>>>>>>>>>>>>>>> I'm seeing the expected content
being posted to SOLR, so it may not be an
>>>>>>>>>>>>>>> issue with TIKA. After adding
some more logging to the CloudSearch
>>>>>>>>>>>>>>> connector, I think the data is
getting lost just before passing it to the
>>>>>>>>>>>>>>> DocumentChunkManager, which inserts
the empty records to the DB. Could it
>>>>>>>>>>>>>>> be that the JSONObjectReader
doesn't like my data?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM,
Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Juan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd try to reproduce as much
of the pipeline as possible
>>>>>>>>>>>>>>>> using a solr output connection.
 If you include the tika extractor in the
>>>>>>>>>>>>>>>> pipeline, you will want to
configure the solr connection to not use the
>>>>>>>>>>>>>>>> extracting update handler.
 There's a checkbox on the Schema tab you need
>>>>>>>>>>>>>>>> to uncheck for that.  But
if you do that you can see what is being sent to
>>>>>>>>>>>>>>>> Solr pretty exactly; it all
gets logged in the INFO messages dumped to solr
>>>>>>>>>>>>>>>> log.  This should help you
figure out if the problem is your tika
>>>>>>>>>>>>>>>> configuration or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please give this a try and
let me know what happens.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28
PM, Juan Pablo Diaz-Vaz <
>>>>>>>>>>>>>>>> jpdiazvaz@mcplusa.com>
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've successfully sent
data to FileSystems and SOLR, but
>>>>>>>>>>>>>>>>> for Amazon CloudSearch
I'm seeing that only empty messages are being sent
>>>>>>>>>>>>>>>>> to my domain. I think
this may be an issue on how I've setup the TIKA
>>>>>>>>>>>>>>>>> Extractor Transformation
or the field mapping. I think the Database where
>>>>>>>>>>>>>>>>> the records are supposed
to be stored before flushing to Amazon, is storing
>>>>>>>>>>>>>>>>> empty content.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've tried to find documentation
on how to setup the TIKA
>>>>>>>>>>>>>>>>> Transformation, but I
haven't been able to find any.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If someone could provide
an example of a job setup to send
>>>>>>>>>>>>>>>>> from a FileSystem to
CloudSearch, that'd be great!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>>>> Full Stack Developer
- MC+A Chile
>>>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>>>> +56 9 84265890
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Mime
View raw message