Hi,
There was some mails regarding sematic searching, and using lucene (
http://jakarta.apache.org/lucene )
as an indexing engine some time ago.
For all who are interested in indexing & searching xml, some noted about
the implementation which
is just at the beginnig:
I have now implemented some avalon components for:
1) Crawling cocoon-view=content, cocoon-view=links
2) Indexing xml documents, as a sample I took the /cocoon/documents URI
space.
The lucene documents have following fields:
* url the url of the document
* body the raw text of all elements of the document
* More over each element, and each attribute of an element generated a
field, too.
Thus searching for "Introduction" searches the body field by default.
Searching for "s1@title:Introduction" searches only for documents having
an attribute title in s1 element matching Introduction.
I have some question, maybe someone may help:
* how can i avoid generating a full http-request, as the crawler sits
inside of cocoon, and indexing
an URI space of the current cocoon engine, there should be(?) some
method accessing the
sitemap, and forwarding it the crawling request, which will speed up the
indexing step.
Any comments are welcome
best regards bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
|