manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-916) Amazon CloudSearch output connector
Date Mon, 19 May 2014 17:42:38 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002049#comment-14002049
] 

Karl Wright commented on CONNECTORS-916:
----------------------------------------

Hi Takumi,

I see a couple of issues with this patch.

(1) There is already a poi download; you can't download two different versions of poi to the
same place and have anything sensible happen.
(2) The patch includes a large number of new jars.  Somebody needs to do research to find
out with the license is for each of these jars, and they will need to be mentioned in the
appropriate license file.  If any of the jars is not an Apache-approved license, something
else will need to be done.  If you can summarize the jars being added and their licenses,
I can write the actual license update.
(3) There's some nonsense code in the build file I don't understand, for example:

{code}
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
{code}

(4) I see a lot of diffs of this kind, which don't appear to be changing anything.  What is
happening here?

{code}
-  {
-    // Establish a session
+  {
+    // Establish a session
     getSession();
     
     SpecPacker sp = new SpecPacker(outputDescription);
     
-    String jsondata = "";
+    String jsondata = "";
{code}

(5) I have some concerns about this:

{code}
+      // generate document id from document URI using md5 checksum.
+      // document id must be shorter than 128 characters because of AmazonCloudSearch.
+      try {
+        doc.setId(URLEncoder.encode(generateid(documentURI),"utf-8"));
+      } catch (NoSuchAlgorithmException e) {
+        throw new ManifoldCFException(e);
+      }
+      
{code}

Generally, since hashes have a possibility of colliding, it is better to send the URI as metadata,
and then look up the document identifier by searching for the document metadata URI.  You
can create a document identifier then only if you don't find the matching record.  If that
is not possible, I'd still structure the code so that there's a single place where the hashing
is done.  You might also look at and use:

http://manifoldcf.apache.org/release/trunk/api/framework/org/apache/manifoldcf/core/system/ManifoldCF.html#encrypt%28java.lang.String%29




Why does this need to be included 5 times? xercesImpl is also included twice too.




> Amazon CloudSearch output connector
> -----------------------------------
>
>                 Key: CONNECTORS-916
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-916
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Takumi Yoshida
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: 0507.diff, 0520.diff, 1.patch, 2.diff, 3.diff, AmazonCloudSearchParam.java,
AmazonCloudSearchSpecs.java, exception_handling.diff, exception_handling_2.diff
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page to Amazon
CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
>  Takumi Yoshida



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message