manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shinichiro Abe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
Date Sat, 18 Jul 2015 02:53:04 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632230#comment-14632230
] 

Shinichiro Abe commented on CONNECTORS-1219:
--------------------------------------------

it will work if we just create new indexsearcher with new indexreader which takes HdfsDirectory.


as to searcher it depends on using near realtime search or not.
(1) coexist writer and searcher
this is a approach like solr/solrcloud or elasticsearch.
indexsearcher can search the documents indexwriter has.
even if to write to hdfs is slow, indexsearcher can search in-memory uncommitted documents
from indexwriter 
(2) separate into writer side and searcher side.
this is a approach like solr's legacy style, master(writer)-slave(searcher) architecture,
so we can't use near realtime search.
indexsearcher searches the documents from hdfs in which there are the documents committed
by indexwriter.

which are fitted to mcf standard?

in solr, elasticsearch, oak and sling, documents are searchable as soon as clients post the
documents. oak and sling are content repository with search index by push model(posts a document
from client, then stores it to repository and index it simultaneously), these are bounded
by jcr standard though. on the other hand, mcf is pull model. the search applications through
output connector have a responsibility for whether documents are searchable soon. so according
to mcf standard, lucene connector will have to choose (2) with the plugin but near realtime
searching is lost. I intended to (1) in the v0.3 patch.

btw, alfresco, liferay and drupal are also content repository with pull model clawls, I heard
it from someone, but these differs from mcf's doc version checking, these can index documents
using something like transaction info about CRUD documents which is managed by repository
side, so documents are indexed soon and are searchable soon. mcf is bounded by a limitation
of repository side, e.g. concurrent access limit(shared drive, web, alfresco, cmis, sharpoint…
almost all repository?) or heavy cpu load on repo side by multi-threading access. unfortunately,
I heard mcf crawls is slow from some users sometimes so far, of course I knew and explained
them that is not in mcf's taking care of, then adjusted repo side or customize existing connectors.
as my first approach for those, I had an idea to index documents to local disk by using lucene
without any http transport and use near realtime search with writer's buffered document, i.e.
(1) approach. currently, I have no idea for repository side limitation though.

> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
>
>
> A output connector for Lucene local index directly, not via remote search engine. It
would be nice if we could use Lucene various API to the index directly, even though we could
do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification,
categorization, and tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message