manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
Date Fri, 17 Jul 2015 05:39:04 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630800#comment-14630800
] 

Karl Wright commented on CONNECTORS-1219:
-----------------------------------------

Hi Abe-san,
This sounds like a workable solution to the cluster problem. Can you
also write your lucene searcher to use the same technology?

Sent from my Windows Phone
From: Shinichiro Abe (JIRA)
Sent: 7/17/2015 1:18 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

    [ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630787#comment-14630787
]

Shinichiro Abe commented on CONNECTORS-1219:
--------------------------------------------

Thanks [~apillaiz], I'd like to collect not only web content but also
manifold repositories content.

 [~DaddyWri], I discovered the
[OakDirectory|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndexEditorContext.java#L89]
which extends Lucene Directory class. I saw the below comment, they
also had multi process(cluster) problem as to Lucene index, and they
put the index to Blob object that means mongodb or rdb storage. From
that, I come to switching Directory impl, for instance, we use
FSDirectory on mcf single process, and use
[HdfsDirectory|http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/store/hdfs/HdfsDirectory.html]
on mcf multi process. The writes to Hdfs was
[slow|https://github.com/ouava/lclient/blob/master/lclient-hdfs/src/main/java/org/apache/lucene/lclient/util/HdfsUtils.java#L47]
when I tried to use before. But this will be expected to improve.
I don't want to use RMI because... first: to avoid complexable
operation or increase 2 steps for bootstrap on single process mode,
second: I don't know how to write the test code, third: around me,
only one user uses multi process and everyone will hope to run mcf as
OOTB as possible,  fourth: jackrabbit 2 has RMI api but oak doesn't
have one. I think RMI is not cool as well as CMIS rather than JCR ,
fifth: I want to make mcf easy to use. These are not technical reason,
but HdfsDirectory will help us.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
>
>
> A output connector for Lucene local index directly, not via remote search engine. It
would be nice if we could use Lucene various API to the index directly, even though we could
do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification,
categorization, and tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message