manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1233) AmazonS3 Repository Connector
Date Tue, 08 Sep 2015 08:24:46 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734410#comment-14734410
] 

Karl Wright commented on CONNECTORS-1233:
-----------------------------------------

Hi [~kbird],
*If* there is no available API way to get the length of the file (I rather doubt this!!),
*then* you can read the file into a local temporary file, and get the length of that.  Then
you can use an input stream created from the temporary file to index using RepositoryDocument.
 Just make sure to clean up the temporary file when done.

There is ManifoldCF API support for temporary files as well, which handles the case when agents
processes are killed or exit unexpectedly.  So I'd look carefully at connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/DataCache.java.

Our main connectors which do this kind of thing are the RSS and Web connectors.

Thanks!

> AmazonS3 Repository Connector
> -----------------------------
>
>                 Key: CONNECTORS-1233
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1233
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Gunaratnam Kuhajeyan
>            Assignee: Karl Wright
>              Labels: features
>             Fix For: ManifoldCF 2.3
>
>         Attachments: amazons3patch-fixunboundedsize.diff, amazons3patch.diff, amazons3patchnew1.diff,
dependencies.docx, patch-removed-unwanted-dependencies-connector-1233.diff, patch-tikaremoved.diff
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> Feature Patch 
> AmazonS3 Repository Connector
> AmazonS3 Repository Connector
> A. Overview
> 1. Connects to Amazons3 buckets, and indexes the artifact. if any buckets to be avoided
it can be skipped ( it can be configured in job)
> 2. Internally documents are parsed and meta data are extracted using Tika
> 3. Support Locale  - English US ( Currently common_en_US.properties, available, looking
for support from some to do the translation for the keys)
> B. Documentation - Work in progress, will be attached issue on the following days
> C. Dependencies - (common-lib)
> 1. aws-java-sdk-{version}.jar
> 2. aws-java-sdk-core-{version}.jar
> 3. aws-java-sdk-s3-{version}.jar
> 4. joda-time-2.2.jar
> D. Connectors.xml
>  <!-- Add your authority connectors here -->
> <authorityconnector name="Amazons3" class="org.apache.manifoldcf.authorities.authorities.amazons3.AmazonS3Authority"/>
 
> <!-- Add your repository connectors here -->
> <repositoryconnector name="AmazonS3" class="org.apache.manifoldcf.crawler.connectors.amazons3.AmazonS3Connector"/>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message