manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject RE: [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
Date Sat, 18 Jul 2015 07:59:05 GMT
Hi Abe-san,
Repository problem is hard to fix because it is a characteristic of the
repository. Only model_add_change_delete connectors would be expected
to work in real time. And none of our connectors have this model
because no repositories support it.

Maybe you could write a repository connector to a push technology that
other repository manufacturers might make an effort to integrate with,
but that would be for the future anyhow.

Sent from my Windows Phone
From: Shinichiro Abe (JIRA)
Sent: 7/17/2015 10:53 PM
Subject: [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector


Shinichiro Abe commented on CONNECTORS-1219:

it will work if we just create new indexsearcher with new indexreader
which takes HdfsDirectory.

as to searcher it depends on using near realtime search or not.
(1) coexist writer and searcher
this is a approach like solr/solrcloud or elasticsearch.
indexsearcher can search the documents indexwriter has.
even if to write to hdfs is slow, indexsearcher can search in-memory
uncommitted documents from indexwriter
(2) separate into writer side and searcher side.
this is a approach like solr's legacy style,
master(writer)-slave(searcher) architecture, so we can't use near
realtime search.
indexsearcher searches the documents from hdfs in which there are the
documents committed by indexwriter.

which are fitted to mcf standard?

in solr, elasticsearch, oak and sling, documents are searchable as
soon as clients post the documents. oak and sling are content
repository with search index by push model(posts a document from
client, then stores it to repository and index it simultaneously),
these are bounded by jcr standard though. on the other hand, mcf is
pull model. the search applications through output connector have a
responsibility for whether documents are searchable soon. so according
to mcf standard, lucene connector will have to choose (2) with the
plugin but near realtime searching is lost. I intended to (1) in the
v0.3 patch.

btw, alfresco, liferay and drupal are also content repository with
pull model clawls, I heard it from someone, but these differs from
mcf's doc version checking, these can index documents using something
like transaction info about CRUD documents which is managed by
repository side, so documents are indexed soon and are searchable
soon. mcf is bounded by a limitation of repository side, e.g.
concurrent access limit(shared drive, web, alfresco, cmis, sharpoint…
almost all repository?) or heavy cpu load on repo side by
multi-threading access. unfortunately, I heard mcf crawls is slow from
some users sometimes so far, of course I knew and explained them that
is not in mcf's taking care of, then adjusted repo side or customize
existing connectors. as my first approach for those, I had an idea to
index documents to local disk by using lucene without any http
transport and use near realtime search with writer's buffered
document, i.e. (1) approach. currently, I have no idea for repository
side limitation though.

> Lucene Output Connector
> -----------------------
>                 Key: CONNECTORS-1219
>                 URL:
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
> A output connector for Lucene local index directly, not via remote search engine. It
would be nice if we could use Lucene various API to the index directly, even though we could
do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification,
categorization, and tagging, using e.g lucene-classification package.

This message was sent by Atlassian JIRA

View raw message