Wow, sounds very cool. How do you feel about sharing/donating that code?
I'd very interesting in working on that.
Just don't expect too much, it is just a first shot.... i hope you manage to make it run at your site...
Just a lot of stuff is not configurable having had time to implement it yet...

* install a lucene.jar from the lucene site
* the lucene index is created in <work-dir>/index.
* create the index by requesting:  createindex.xsp
* search the index by requesting: searchindex.xsp, entering a query string, having skipped implementing a paging if lots of matches are
* see statistics about the created index using statisticindex.xsp, my be used to help searching more effectifly
* load my.roles for declaring the new avalon components regarding indexing&searching

DocumentHandler parses the XML document, implements the XML to lucene Document generation,
and creates the fiels of the lucene document,
Lucene document does NOT store any xml content,

Perhaps you find some better design, currently I didn't implement any SitemapComponents, just
pure avalon componets, all named "Simple*Impl.java", interfaces named "*.java".
Perhaps you find some desing fitting the components into generator, transformer, serializer pattern,
i thought about it but i gave up, coming up with this more general solution, perhaps
even the ParentCM may be used?

Some feeling about searching:

Index Search

Search Help
free AND "text search"
Search for documents containing "free" and the phrase "text search"
+text search
Search for documents containing "text" and preferentially containing "search".
  • giants -football Search for "giants" but omit documents containing "football"
  • body:john Search for documents containing "john" in the body field. The field "body" is used by default. Thus query "body:john" is equivalent to query "john".
  • s1@title:cocoon Search for documents containing "cocoon" in the cocoon field s1@title, ie searching in title attribute of s1 element of xml document.

SearchResult: Total Hits: 13

Index Statistic

Score Count URL
100% 0 http://localhost:8080/cocoon/documents/userdocs/generators/jsp-generator.html
34% 1 http://localhost:8080/cocoon/documents/userdocs/generators/generators.html
27% 2 http://localhost:8080/cocoon/documents/ctwig/ctwig-gettingstarted.html
27% 3 http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html
27% 4 http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html
19% 5 http://localhost:8080/cocoon/documents/userdocs/concepts/index.html
16% 6 http://localhost:8080/cocoon/documents/ctwig/ctwig-why.html
10% 7 http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet-concepts.html
8% 8 http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet.html
7% 9 http://localhost:8080/cocoon/documents/faq.html

The Cocoon CLI does crawling internally without the overhead of HTTP

Follow the flow at Cocoon.main() to know how that is done.
I will check it out...

bye bernhard