cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Conal Tuohy" <Conal.Tu...@vuw.ac.nz>
Subject RE: Custom extensions - to be made available if possible
Date Thu, 09 Sep 2004 07:24:44 GMT
Antonio Fiol BonnĂ­n wrote:

> a) Refactoring SimpleLuceneXMLIndexerImpl so that its private method
> indexDocument is not private, and taking it to an external component.
> 
> b) Creating a PDFGenerator (in the cocoon sense of generator, 
> of course).
> 
> Option (a) seems to be giving us more headaches than pleasure, and
> option (b) seems cleaner to a certain point. Option (b) would allow to
> follow links in the PDF file, if developed to that point.

I like option (b) too. You could start with plain text, but it could later be developed to
extract basic formatting, hyperlinks, bookmarks (the table of contents), images, etc.

> However, option (b) implies choosing a format for its output (which?),

An interesting question. Perhaps html, and begin with an implementation which produces:

<html>
   <head/>
   <body>
      blah blah blah<br/>
      blah blah<br/>
      <br class="page"/>
      ... 
   </body>
</html>

Later you (or someone else) could add extra things as they need them. 

Alternatively, you could use a more PDF-oriented DTD.

I have used a simple freeware tool called pdftohtml which produces XML according to the following
DTD:

<!ELEMENT pdf2xml (page+)>
<!ELEMENT page (fontspec*, text*)>
<!ATTLIST page
	number CDATA #REQUIRED
	position CDATA #REQUIRED
	top CDATA #REQUIRED
	left CDATA #REQUIRED
	height CDATA #REQUIRED
	width CDATA #REQUIRED
>
<!ELEMENT fontspec EMPTY>
<!ATTLIST fontspec
	id CDATA #REQUIRED
	size CDATA #REQUIRED
	family CDATA #REQUIRED
	color CDATA #REQUIRED
>
<!ELEMENT text (#PCDATA | b | i)*>
<!ATTLIST text
	top CDATA #REQUIRED
	left CDATA #REQUIRED
	width CDATA #REQUIRED
	height CDATA #REQUIRED
	font CDATA #REQUIRED
>
<!ELEMENT b (#PCDATA)>
<!ELEMENT i (#PCDATA)>

> and also poses some problems wrt. the sitemap. Until now, we have a
> pipeline using a reader to read pdf files (static, from disk). And we
> would need a generator to be invoked instead for the content and links
> views. How can we do that? Maybe with a selector? But that does not
> seem very clean. Any hints there?

I'm not sure. It might work. I hope someone else can help you with that. But NB there's also
another way to build a Lucene index - using the LuceneIndexTransformer rather than by crawling
the site and using views. This technique would certainly work with option (b) - a PDFGenerator
- but I'm not sure that it would integrate nicely with option (a) since it's a transformer
and therefore requires XML. So if you could resolve the sitemap issue with option (b) then
it would work with both indexing techniques, whereas option (a) could only ever work with
the crawler, I think.

Cheers

Con

Mime
View raw message