manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <m.gro...@sourcesense.com>
Subject Re: questions emerged designing a connector to index solrxml documents
Date Fri, 13 Jun 2014 16:22:35 GMT
thanks very much Karl

Can you also respond to the part regarding the state change?
In the filesystem connector I don't see a method call that could change the state of the directory
to processed
I was thinking that 
	if processDocuments() is called with the identifier "/toIndex/hd.xml" 
	and there are no exceptions
	this could be enough to put "/toIndex/hd.xml" in state "processed"
	am I right?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:

> HI Matteo,
> 
> What I'd recommend is that you create a document identifier for each solr
> document, and a different kind of document identifier for each xml file.
> The xml file would then be like a "directory", and the solr document would
> be like the "file".  You then can use carry-down support to allow the xml
> file to be parsed only once.  A similar approach is used for the RSS
> connector.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <m.grolla@sourcesense.com>
> wrote:
> 
>> Hi,
>>        I'd like to develop a connector to index solr xml documents to a
>> solr instance. By the way I'm absolutely willing to contribute the code.
>> I have a few questions that I hope you can answer.
>> 
>> I'm starting from the filesystem connector, since it seems the most similar
>> A big difference though is that now a single file can represent many
>> documents.
>> 
>> How can I handle this efficiently?
>> Suppose I leave the seeding phase as the filesystem connector
>> (getDocumentIdentifiers() method)
>> in the docProcessing phase (processDocuments() method) I:
>> 1)obtain a filepath
>> 2)parse the xml file
>> 3)seed the ids of the solr documents and add a child relation from those
>> ids to the file path.
>>        Ex. I seed the identifier "hd-samsung-500GB" which identifies one
>> of the documents contained in the files "/toIndex/hd.xml"
>>                let's pretend that hd.xml contains 50 solr documents
>> 4)when manifold calls processDocuments() with the identifier
>> "hd-samsung-500GB"
>>        I could follow the parent relation to "/toIndex/hd.xml"
>>        reparse the file
>>        create a RepositoryDocument using the information related to
>> "hd-samsung-500GB"
>>        ingest this RepositoryDocument
>> …
>> but this would be a very wasteful approach
>> 
>> Ideally I'd like to parse the xml file only once
>> 
>> I was thinking I could do what follows in the seeding phase
>>        parse the file
>>        create a RepositoryDocument for every solrdocument
>>        serialize them in the document identifier
>> …
>> but I think this would make really ugly identifiers in the status reports
>> what do you think? Is there a better way to do it?
>> 
>> Another thing that confuses me is how (manifold) documents change state
>> Ex.
>>        In the filesystem connector I crawl 1 directory with 1 file
>>        afterwards I look at the document status report and see that both
>> the directory and the file have state "processed"
>>        the document has been ingested so I think the ingest method caused
>> the status change
>>        what method caused the state change for the directory?
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message