nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jérôme Charron" <>
Subject Re: Content-Type inconsistency?
Date Thu, 13 Apr 2006 19:57:14 GMT
I would like to come back on this issue:
The Content object holds two content-types:
1. The raw content-type from the protocol layer (http header in case of
http) in the Content's metadata
2. The guessed content-type in a private field content-type.

When a ParseData object is created, it takes only the Content's metadata.
So, the ParseData can only access the raw content type and not the one

What I suggest is :
1. add a content-type parameter in the ParseData constructors (so that
Parsers  can pass the guessed content-type to ParseData).
2. The Content object stores the guessed content-type in it's metadata in a
special attribute named for instance GUESSED_CONTENT_TYPE, so that the
ParseData can access it

I think 1. is really cleanest way to implement this, but there is a lot of
code impacted => all the parsers.
Solution 2. have no impact on APIs, so the code changes are very small.

Suggestions? Comments?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message