nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jérôme Charron" <jerome.char...@gmail.com>
Subject Content-Type inconsistency?
Date Mon, 10 Apr 2006 21:08:29 GMT
It seems there is an inconsistency with content-type handling in Nutch:

1. The protocol level content-type header is added in content's metadata.
2. The content-type is then checked/guessed while instanciating the Content
object and stored in a private field
(at this step, the Content object can have 2 different content-types).
3. The Content's private field for content-type is used to find the good
parser.
4. Once the Parse object is constructed, the Content is no more used (=> the
guessed content-type is lost)
5. Then the index-more plugin index the raw content-type and not the guessed
one
6. As a consequence the content-type displayed in more.jsp is the raw one,
and the one used to query on type is the raw one too.

Wouldn't it be better to always use the guessed content-type all along the
process?
(except in cache.jsp, where the raw one should be used)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message