manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: questions emerged designing a connector to index solrxml documents
Date Fri, 13 Jun 2014 16:29:47 GMT
Hi Matteo,

The framework will take care of the state change.  You do not try to do
that within the connector.  All you do is process the document(s) that are
handed to you.

So, for example, if you have the following document identifiers:

/toIndex/hd.xml (identifiable as a file)
/toIndex/hd.xml:0 (first document within hd.xml)
/toIndex/hd.xml:1 (second document within hd.xml)

etc.

Then, if you see a processDocuments() request for "/toIndex/hd.xml", you
pick up the XML and parse it, calling IProcessActivity.addReference() for
each solr document within (and you construct the document identifier too
during the same pass, and the carrydown content information you extract).
If you see a processDocuments() request for /toIndex/hd.xml:0, then you
simply pick up the content that is passed to you in the carrydown, and call
activities.ingestDocument() with it.

States do not *ever* come into connector design; the framework always takes
care of that.

Thanks,
Karl



On Fri, Jun 13, 2014 at 12:22 PM, Matteo Grolla <m.grolla@sourcesense.com>
wrote:

> thanks very much Karl
>
> Can you also respond to the part regarding the state change?
> In the filesystem connector I don't see a method call that could change
> the state of the directory to processed
> I was thinking that
>         if processDocuments() is called with the identifier
> "/toIndex/hd.xml"
>         and there are no exceptions
>         this could be enough to put "/toIndex/hd.xml" in state "processed"
>         am I right?
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:
>
> > HI Matteo,
> >
> > What I'd recommend is that you create a document identifier for each solr
> > document, and a different kind of document identifier for each xml file.
> > The xml file would then be like a "directory", and the solr document
> would
> > be like the "file".  You then can use carry-down support to allow the xml
> > file to be parsed only once.  A similar approach is used for the RSS
> > connector.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <
> m.grolla@sourcesense.com>
> > wrote:
> >
> >> Hi,
> >>        I'd like to develop a connector to index solr xml documents to a
> >> solr instance. By the way I'm absolutely willing to contribute the code.
> >> I have a few questions that I hope you can answer.
> >>
> >> I'm starting from the filesystem connector, since it seems the most
> similar
> >> A big difference though is that now a single file can represent many
> >> documents.
> >>
> >> How can I handle this efficiently?
> >> Suppose I leave the seeding phase as the filesystem connector
> >> (getDocumentIdentifiers() method)
> >> in the docProcessing phase (processDocuments() method) I:
> >> 1)obtain a filepath
> >> 2)parse the xml file
> >> 3)seed the ids of the solr documents and add a child relation from those
> >> ids to the file path.
> >>        Ex. I seed the identifier "hd-samsung-500GB" which identifies one
> >> of the documents contained in the files "/toIndex/hd.xml"
> >>                let's pretend that hd.xml contains 50 solr documents
> >> 4)when manifold calls processDocuments() with the identifier
> >> "hd-samsung-500GB"
> >>        I could follow the parent relation to "/toIndex/hd.xml"
> >>        reparse the file
> >>        create a RepositoryDocument using the information related to
> >> "hd-samsung-500GB"
> >>        ingest this RepositoryDocument
> >> …
> >> but this would be a very wasteful approach
> >>
> >> Ideally I'd like to parse the xml file only once
> >>
> >> I was thinking I could do what follows in the seeding phase
> >>        parse the file
> >>        create a RepositoryDocument for every solrdocument
> >>        serialize them in the document identifier
> >> …
> >> but I think this would make really ugly identifiers in the status
> reports
> >> what do you think? Is there a better way to do it?
> >>
> >> Another thing that confuses me is how (manifold) documents change state
> >> Ex.
> >>        In the filesystem connector I crawl 1 directory with 1 file
> >>        afterwards I look at the document status report and see that both
> >> the directory and the file have state "processed"
> >>        the document has been ingested so I think the ingest method
> caused
> >> the status change
> >>        what method caused the state change for the directory?
> >>
> >> --
> >> Matteo Grolla
> >> Sourcesense - making sense of Open Source
> >> http://www.sourcesense.com
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message