manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Wilansky <ethanwilan...@gmail.com>
Subject Re: Google Drive processing
Date Mon, 27 Oct 2014 19:17:45 GMT
Hi Karl,
In simple history there are no indexing activity records showing 0. All of the content on
this Google Drive endpoint are either small uploaded files (docx, pptx, pdf, txt) or Google
Docs generated documents, spreadsheets and presentations. 

With regard to opening a ticket, it might not be worth your while. Ultimately, our use case
is that we will be leveraging an ES Output Connection for retrieving metadata and we will
store the binaries on the file system. We don’t want to use the ES Attachment plug-in, which
is why I thought we might be able to combine the ES Output Connection and a File System Connection
in a job. I suppose another option would be to involve Tika, but I’m not clear on whether
this will allow me to store the metadata in ES with a pointer to the binary in the file system.

Thanks,
Ethan


> 
> 
> 
> 
> 
> 
> 
> On Oct 27, 2014, at 2:27 PM, Karl Wright <daddywri@gmail.com> wrote:
> 
> Hi Ethan,
> 
> This does not sound like it is related in any way to the google drive connection, unless
for some reason the google API is considering some of the documents fetched to have only metadata
and no content.  In this case, you'd see size of zero in the simple history for indexing activity
record.  Is that what you see?
> 
> As for the filename issues -- file system output connection is supposed to emulate WGET.
 However, there are a number of known issues with this connector, for example CONNECTORS-814,
and I believe the handling of "&" is one such issue.  I don't think these characters are
allowed file names on several operating systems.
> 
> Please open a ticket, and describe how you think it should behave (e.g. how it should
map &'s in urls to legal file name characters), and I'll try to come up with a quick patch.
> 
> Karl
> 
> 
> On Mon, Oct 27, 2014 at 12:15 PM, Ethan Wilansky <ethanwilansky@gmail.com <mailto:ethanwilansky@gmail.com>>
wrote:
> I’ve run a job that uses a Google Drive Repository Connection and File System Output
Connection. My output is pointing to d:\temp\mf on the machine running ManifoldCF. 
> 
> Upon running the job, job status shows:
> Error: Could not create file 'd:\temp\mf\https\doc-0g-1c-docs.googleusercontent.com <http://doc-0g-1c-docs.googleusercontent.com/>\docs\securesc\288dijb8
lhptipmnpc6n3dap4bdki35j\ek70aeovi25lp7aibkar61h90pi1i2c3\1414418400000\14058876669334088852\07105634325979498590\0B4rsPDZwaBMUZjI3VGpzZi10dUU?h=00194472260389282923&e=download&gd=true'
(The filename, directory name, or volume label syntax is incorrect)
> 
> This same report that the file name, label or syntax is incorrect is being reported by
the file system one more time. So, out of 12 files total, 10 are processed. However, for the
files that are reported as successfully processed, none of the files appear in the file system.

> 
> I think the file system path is unusual beyond what I’ve specified for the job (d:\temp\mf).
I’m seeing something like the following as the path structure:
> D:\temp\mf\https\doc-0g-1c-docs.googleusercontent.com <http://doc-0g-1c-docs.googleusercontent.com/>\docs\securesc\288dijb8lhptipmnpc6n3dap4bdki35j\ek3m4mhv978b7a2elgov6cm9nipbv36e\1414418400000\13058876669334088852\07105634445979498592
> 
> Document Status and Queue Status show nothing unusual. I’m running on ManifoldCF release
(v1.7.1)
> 
> Could this be an issue with the way I’m configuring the File System Output Connection
or is there something else I need to configure? I properly configured the refresh token, client
id and client secret in the Repository Connection. 
> 
> I’ve attached the JSON for the Repository Connection (with client id, client secret
and refresh token values removed), my Output Connection and Job Definition.
> 
> Thanks in advance for your feedback
> Ethan
> 
> 
> 
> 
> ,
> 
> 


Mime
View raw message