manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <m.gro...@sourcesense.com>
Subject questions emerged designing a connector to index solrxml documents
Date Fri, 13 Jun 2014 15:48:32 GMT
Hi,
	I'd like to develop a connector to index solr xml documents to a solr instance. By the way
I'm absolutely willing to contribute the code.
I have a few questions that I hope you can answer.

I'm starting from the filesystem connector, since it seems the most similar
A big difference though is that now a single file can represent many documents.

How can I handle this efficiently?
Suppose I leave the seeding phase as the filesystem connector (getDocumentIdentifiers() method)
in the docProcessing phase (processDocuments() method) I:
1)obtain a filepath
2)parse the xml file
3)seed the ids of the solr documents and add a child relation from those ids to the file path.
	Ex. I seed the identifier "hd-samsung-500GB" which identifies one of the documents contained
in the files "/toIndex/hd.xml"
		let's pretend that hd.xml contains 50 solr documents
4)when manifold calls processDocuments() with the identifier "hd-samsung-500GB" 
	I could follow the parent relation to "/toIndex/hd.xml"
	reparse the file
	create a RepositoryDocument using the information related to "hd-samsung-500GB" 
	ingest this RepositoryDocument
…
but this would be a very wasteful approach

Ideally I'd like to parse the xml file only once

I was thinking I could do what follows in the seeding phase
	parse the file 
	create a RepositoryDocument for every solrdocument
	serialize them in the document identifier
…
but I think this would make really ugly identifiers in the status reports
what do you think? Is there a better way to do it?

Another thing that confuses me is how (manifold) documents change state
Ex. 
	In the filesystem connector I crawl 1 directory with 1 file
	afterwards I look at the document status report and see that both the directory and the file
have state "processed"
	the document has been ingested so I think the ingest method caused the status change
	what method caused the state change for the directory?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message