manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
Date Thu, 16 Jul 2015 07:52:05 GMT


Karl Wright commented on CONNECTORS-1219:

bq. After that I thought mcf could become the the best lowest indexing latency application
when we set up mcf single processes to each node. The each node has each index.

Hi Abe-san,

Thank you, this makes it more clear what you are trying to do. I will need to think about
the whole problem carefully for a time to be sure there is a solution that meets your goal.
 But it is worth mentioning that a separate process that you communicate to over a socket
is not *necessarily* slow.  On unix systems, at least, this can be very very fast on localhost,
and even when not on localhost it can be made fast too with proper network architecture.

The alternative is really to create a Lucene application that wraps MCF, rather than the other
way around.  I'd have to think carefully about that but I believe you'd want to create your
own war, something like combined.war, which would include your lucene service as well as the
crawler UI.  It's not ideal because the lucene connector would not work like other connectors,
but there would at least be a possibility of deployment under tomcat, and there would not
be a Lucene dependency for most people who aren't doing real-time work.

So, if using a sidecar process is where you choose to go:

My original idea was to serialize the document, not the LuceneClient or IndexWriter.  But
with RMI that would require two things: first, document would have to be written to a temporary
disk file, and second, somewhere we would need a persistent LuceneClient class created in
the sidecar process.  That is not typical with RMI, and writing to disk is slower too than
using a stream over a socket.

The sidecar process would, though, have jetty anyway.  So you could have a servlet that listened
for three things: HTTP POST of a multipart document, HTTP DELETE given a document ID, and
HTTP GET to get status.  Streaming a multipart document using HttpClient from the Lucene connector
would be straightforward and would not involve a temporary disk file.  On the sidecar process
side, I also believe you would be able to wrap the incoming post and its metadata in Reader
objects if you were careful.  The LuceneClient would be present in the sidecar Jetty process
only, and could be initialized as part of servlet initialization, so no serialization would
be needed.  The Lucene Connector would only have to stream the document using HttpClient.

Some coding would be needed to figure out which of these possibilities works best for your
purpose.  But I think those are your main choices.



> Lucene Output Connector
> -----------------------
>                 Key: CONNECTORS-1219
>                 URL:
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
> A output connector for Lucene local index directly, not via remote search engine. It
would be nice if we could use Lucene various API to the index directly, even though we could
do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification,
categorization, and tagging, using e.g lucene-classification package.

This message was sent by Atlassian JIRA

View raw message