manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: questions emerged designing a connector to index solrxml documents
Date Fri, 13 Jun 2014 15:54:48 GMT
HI Matteo,

What I'd recommend is that you create a document identifier for each solr
document, and a different kind of document identifier for each xml file.
The xml file would then be like a "directory", and the solr document would
be like the "file".  You then can use carry-down support to allow the xml
file to be parsed only once.  A similar approach is used for the RSS
connector.

Thanks,
Karl



On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <m.grolla@sourcesense.com>
wrote:

> Hi,
>         I'd like to develop a connector to index solr xml documents to a
> solr instance. By the way I'm absolutely willing to contribute the code.
> I have a few questions that I hope you can answer.
>
> I'm starting from the filesystem connector, since it seems the most similar
> A big difference though is that now a single file can represent many
> documents.
>
> How can I handle this efficiently?
> Suppose I leave the seeding phase as the filesystem connector
> (getDocumentIdentifiers() method)
> in the docProcessing phase (processDocuments() method) I:
> 1)obtain a filepath
> 2)parse the xml file
> 3)seed the ids of the solr documents and add a child relation from those
> ids to the file path.
>         Ex. I seed the identifier "hd-samsung-500GB" which identifies one
> of the documents contained in the files "/toIndex/hd.xml"
>                 let's pretend that hd.xml contains 50 solr documents
> 4)when manifold calls processDocuments() with the identifier
> "hd-samsung-500GB"
>         I could follow the parent relation to "/toIndex/hd.xml"
>         reparse the file
>         create a RepositoryDocument using the information related to
> "hd-samsung-500GB"
>         ingest this RepositoryDocument
> …
> but this would be a very wasteful approach
>
> Ideally I'd like to parse the xml file only once
>
> I was thinking I could do what follows in the seeding phase
>         parse the file
>         create a RepositoryDocument for every solrdocument
>         serialize them in the document identifier
> …
> but I think this would make really ugly identifiers in the status reports
> what do you think? Is there a better way to do it?
>
> Another thing that confuses me is how (manifold) documents change state
> Ex.
>         In the filesystem connector I crawl 1 directory with 1 file
>         afterwards I look at the document status report and see that both
> the directory and the file have state "processed"
>         the document has been ingested so I think the ingest method caused
> the status change
>         what method caused the state change for the directory?
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message