manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Google Drive processing
Date Mon, 27 Oct 2014 21:03:35 GMT
Hi Ethan,

The activity logging for most connectors in 1.7.1 is not complete.  See
CONNECTORS-1077 for details.  But in that case your should not see a record
labeled with your connection name of the type "document ingest", and you
*should* see errors in the ManifoldCF log.


On Mon, Oct 27, 2014 at 3:17 PM, Ethan Wilansky <ethanwilansky@gmail.com>
wrote:

> Hi Karl,
> In simple history there are no indexing activity records showing 0. All of
> the content on this Google Drive endpoint are either small uploaded files
> (docx, pptx, pdf, txt) or Google Docs generated documents, spreadsheets and
> presentations.
>
> With regard to opening a ticket, it might not be worth your while.
> Ultimately, our use case is that we will be leveraging an ES Output
> Connection for retrieving metadata and we will store the binaries on the
> file system. We don’t want to use the ES Attachment plug-in, which is why I
> thought we might be able to combine the ES Output Connection and a File
> System Connection in a job. I suppose another option would be to involve
> Tika, but I’m not clear on whether this will allow me to store the metadata
> in ES with a pointer to the binary in the file system.
>
> Thanks,
> Ethan
>
>
>
>
>
>
>
>
>
> On Oct 27, 2014, at 2:27 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi Ethan,
>
> This does not sound like it is related in any way to the google drive
> connection, unless for some reason the google API is considering some of
> the documents fetched to have only metadata and no content.  In this case,
> you'd see size of zero in the simple history for indexing activity record.
> Is that what you see?
>
> As for the filename issues -- file system output connection is supposed to
> emulate WGET.  However, there are a number of known issues with this
> connector, for example CONNECTORS-814, and I believe the handling of "&" is
> one such issue.  I don't think these characters are allowed file names on
> several operating systems.
>
> Please open a ticket, and describe how you think it should behave (e.g.
> how it should map &'s in urls to legal file name characters), and I'll try
> to come up with a quick patch.
>
> Karl
>
>
> On Mon, Oct 27, 2014 at 12:15 PM, Ethan Wilansky <ethanwilansky@gmail.com>
> wrote:
>
>> I’ve run a job that uses a Google Drive Repository Connection and File
>> System Output Connection. My output is pointing to d:\temp\mf on the
>> machine running ManifoldCF.
>>
>> Upon running the job, job status shows:
>> Error: Could not create file 'd:\temp\mf\https\
>> doc-0g-1c-docs.googleusercontent.com\docs\securesc\288dijb8 lhptipmnpc6n3dap4bdki35j\ek70aeovi25lp7aibkar61h90pi1i2c3\1414418400000\14058876669334088852\07105634325979498590\0B4rsPDZwaBMUZjI3VGpzZi10dUU?h=00194472260389282923&e=download&gd=true'
>> *(The filename, directory name, or volume label syntax is incorrect)*
>>
>> This same report that the file name, label or syntax is incorrect is
>> being reported by the file system one more time. So, out of 12 files total,
>> 10 are processed. However, for the files that are reported as successfully
>> processed, none of the files appear in the file system.
>>
>> I think the file system path is unusual beyond what I’ve specified for
>> the job (d:\temp\mf). I’m seeing something like the following as the path
>> structure:
>> D:\temp\mf\https\doc-0g-1c-docs.googleusercontent.com
>> \docs\securesc\288dijb8lhptipmnpc6n3dap4bdki35j\ek3m4mhv978b7a2elgov6cm9nipbv36e\1414418400000\13058876669334088852\07105634445979498592
>>
>> Document Status and Queue Status show nothing unusual. I’m running on
>> ManifoldCF release (v1.7.1)
>>
>> Could this be an issue with the way I’m configuring the File System
>> Output Connection or is there something else I need to configure? I
>> properly configured the refresh token, client id and client secret in the
>> Repository Connection.
>>
>> I’ve attached the JSON for the Repository Connection (with client id,
>> client secret and refresh token values removed), my Output Connection and
>> Job Definition.
>>
>> Thanks in advance for your feedback
>> Ethan
>>
>>
>>
>>
>> ,
>>
>>
>
>

Mime
View raw message