nutch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann
Date Thu, 15 Sep 2005 03:34:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by ChrisMattmann:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal

------------------------------------------------------------------------------
  = Parser Factory Improvement Proposal =
  
+ Jerome Charron <jerome.charron@gmail.com>, 
+ Sébastien Le Callonnec <slc_ie@yahoo.ie>, 
+ Chris A. Mattmann <chris.mattmann@jpl.nasa.gov>
+ 
+ Wednesday, September 14th, 2005
+ 
+ '''DRAFT'''
  
  == Summary of Issue ==
  Currently Nutch provides a plugin mechanism wherein plugins register certain metadata about
themselves, including their id, classname, and so forth. In particular, the set of parsing
plugins register which contentTypes and file suffixes they can support with a PluginRepository.
@@ -20, +27 @@

  We propose that the “plugin preference list” should be a separate file that lives in
$NUTCH_HOME/conf called “parse-plugins.xml”. The format of the file (full DTD to be developed
during coding) should be something like: {{{
  
  <parse-plugins>
-   <default pluginname=”parse-text”/>
-   <fileType name=”powerpoint”>
-    <mimeTypes>
-     <mimeType name=”application/pdf” />
-     <mimeType name=”application/x-pdf” />
-     …
-    </mimeTypes>
  
-    <plugins>
+   <mimeType name="*">
+       <plugin name=”parse-text” order=”1”/>
+       <plugin name=”another-one-default-parser” order=”2”/>
+      ....
+   </mimeType>
+   
+   <mime-type name="application/vnd.ms-powerpoint">
+    <!-- if no order is specified, then order is significant -->
+     <plugin id="parse-mspowerpoint"/>
+   </mime-type>
  
+   <mime-type name="application/pdf">
-       <plugin name=”parse-pdf” order=”1”/>
+     <plugin id="parse-pdf" order="1"/>
-       <plugin name=”parse-pdf-worse” order=”2”/>
+     <plugin id="parse-pdf-worse" order="2" />
+   </mime-type>
+   ....
-      …
-    </plugins>
-   </fileType>
-     …
  </parse-plugins>
  
  }}}
  
+ === Activating Parse Plugins ===
+ If an activated parse plugin is not listed in the parse-plugins.xml, then it won’t get
called for parsing. The purpose of the parse-plugins.xml file would be to map parsing-plugin
to contentType. Therefore, if an activated plugin is not mapped to a content type, then it
is “activated”, but won’t get called. This is very similar to Apache HTTPD. See below:
  
- One of the main impacts of having a file like parse-plugins.xml is that no longer should
the pathSuffix="" be part of the plugin.xml descriptor. We propose to move that out of plugin.xml
and into the mime-types.xml file.
+ {{{
+ //httpd.conf example
+ //add handler for php
+ 
+ LoadModule php4_module        libexec/httpd/libphp4.so
+ 
+ // map handler to mimeType
+ AddType application/x-httpd-php .php
+ AddType application/x-httpd-php-source .phps
+ 
+ AddHandler php-script   php
+ AddHandler php-script   phps
+ }}}
+ There are two different levels in the above example. First, the plugin is “activated”
in the LoadModule section. Then, the plugin is “mapped” to a content type in the AddHandler
section. We believe that this is the way to go. Apache HTTPD is pervasive, and its model is
well understood by many of the same folks who would want to use Nutch. Although we realize
that this is a change from the way that Nutch currently works, and that people don’t like
change, we believe that this change is entirely needful and represents something that Nutch
should adopt.
+ 
+ === Maintaining consistency between parse-plugins.xml and nutch-default.xml activated plugins
===
+ An interesting question arises in the following two examples:
+ 
+  *No plugin defined in parse-plugins for a specified content-type, but many activated plugins
that can deal with this content-type.
+  *Many plugins defined in the parse-plugins for a specified content-type, but with the same
priority
+ 
+ This is unfortunately is something that as developers we cannot elegantly prevent in this
case – erroneous input by the user. We propose a simple way to handle this is: if the user
specifies multiple parse-plugins with the same priority, then LOG.severe(), and exit. This
isn’t anything outside of what other systems do with bogus user input. For instance, in
Apache HTTPD, if a user specifies that .cgi files should be handled by a text-handler, ''and''
by a perl-handler, Apache HTTPD will come back, and log an error message, and exit, which
we believe is the correct thing to do in that case. The parse-plugins.xml file will need to
be examined by the users of the Nutch system, and they will need to ensure that they don’t’
set the priorities for 2 different parse plugins to be the same for a particular mimeType.
We propose to note this in a comment in the parse-plugins.xml file, and then also note it
as a major change in the Nutch installation process.
+ 
+ === Path Suffix Attribute in plugin.xml files and erroneous mime types returned by web servers
for files ===
+ Another one of the main impacts of having a file like parse-plugins.xml is that no longer
should the pathSuffix="" be part of the plugin.xml descriptor. We propose to move that out
of plugin.xml and into the mime-types.xml file. Additionally, we can also "kill two birds
with one stone" here and handle an oft-occuring problem users are experiencing with Nutch
in terms of errorneous mime types returned by web servers for particular files. Specifically
we propose to add an MimeType Alias mapper to the mime-types.xml file that will allow us to
map the standard IANA mime types to other web server returned mime types that are non-standard.
These two proposed changes to mime-types.xml would look like the following:
+ 
+ {{{
+ 
+ <!-- mime-types.xml file -->
+   <mime-type name="application/vnd.ms-powerpoint">
+     <!-- pathSuffix lives here now -->
+       <ext>ppt</ext>
+       <magic offset="....." type="..." value="..."/>
+ 
+     <!-- here are other mime types that are not the default IANA mime types, but still
returned by servers -->
+       <mapped-type name="application/powerpoint"/>
+       <mapped-type name="application/mspowerpoint"/>
+    </mime-type>
+  
+ }}}
+ 
+ To handle this mapping, two new methods should be added to the mime types class. In particular,
we propose a {{{public static MimeType map(MimeType);}}} method and a {{{public static MimeType
map(String);}}} method to be added to the MimeType java class to handle the mapping in the
mime-types.xml file.
+ 
+ 
+ === Proposal Task Summary ===
+ To summarize, our proposal to improve the parser factory consists of the following tasks:
+ 
+  1. Provide a mime-type mapper (based on IANA) in the util.mime package. Implementation
to be refined: Uses and extension of the existing mime-type.dtd
+  2. Define a schema for the parse-plugin.xml file
+  3. Deprecate the pathSuffix from plugin.xml file
+  4. ParserFactory must check the content-type used in the parse-plugin.xml file and the
content-type(s) specified in the plugin.xml; If it matches, all is ok, if not the plugin is
not used.
  
  == Architectural Impact ==
  
@@ -49, +109 @@

   *Fetcher
   *PluginSystem
   *ParserFactory
+  *MimeTypeSystem
  
  === Impact on current releases of Nutch ===
  
  ''Incompatibilities''
  
- By moving the contentType and pathSuffix out of the plugin.xml file, this would create an
updated version of the plugin.xml descriptor schema for each plugin. To lessen the effect
on previous and near-term releases of Nutch this information could be left as an option in
the plugin.xml schema, but marked as “deprecated” to let people know that this functionality
isn’t part of the parse plugin identification process anymore, but it is left in the schema
so as not to create incompatibilities with the plugin.xml files that people have already wrote.
However, ultimately in future releases of Nutch, we propose that the contentType and pathSuffix
attributes should be removed from the plugin.xml schema.
+ By moving the pathSuffix out of the plugin.xml file, and into the mime-types.xml file, this
would create an updated version of the plugin.xml descriptor schema for each plugin, along
with an updated mime-types.xml descriptor schema. Additionally, storing the mime type aliases
in the mime-types.xml file will also require an addition to the mime-types.xml schema. To
lessen the effect on previous and near-term releases of Nutch the pathSuffix attribute could
be left as an option in the plugin.xml schema, but marked as “deprecated” to let people
know that this functionality isn’t part of the parse plugin identification process anymore,
but it is left in the schema so as not to create incompatibilities with the plugin.xml files
that people have already wrote. However, ultimately in future releases of Nutch, we propose
that the pathSuffix attribute should be removed from the plugin.xml schema.
  
- Other than the plugin.xml file schema change, this capability addition will simply control
the order in which parsing plugins get called during fetching activities. It won’t directly
impact the segments stored, or the webapp, or any of the main components of Nutch.
+ The proposed capability addition will simply control the order in which parsing plugins
get called during fetching activities. It won’t directly impact the segments stored, or
the webapp. It will only affect the fetcher component, and the mime types component.
  
  ''Issues''
  
  The proposed new capabilities should be first tested on local systems, and if successful,
uploaded to JIRA, and verified against the latest SVNs.
- Unit tests should be written to verify appropriate plugin parsing order.
- Users will need to be notified in the Nutch tutorial and instruction lists about how to
set up the parsing plugin preferences prior to performing a fetch.
+ Unit tests should be written to verify appropriate plugin parsing order. Users will need
to be notified in the Nutch tutorial and instruction lists about how to set up the parsing
plugin preferences prior to performing a fetch.
  
  == Personnel ==
  
@@ -72, +132 @@

  
  == Timeframe ==
  
-  *Begin work the weekend of 9/9
+  *Begin work the weekend of 9/16
-  *Complete first prototype patches to JIRA by end of week, 9/18
+  *Complete first prototype patches to JIRA by end of week, 9/25
-  *Test against latest SVNs of Nutch, by 9/25
+  *Test against latest SVNs of Nutch, by 10/1
-  *Delivery of operational capability, by 10/1
+  *Delivery of operational capability, by 10/8
  
  == Affected files ==
   *PluginRepository.java
   *PluginManifestParser.java
   *ParserFactory.java
   *plugin.xml descriptor files
+  *mime-types.xml file
+  *addition of parse-plugins.xml file
   *files in package {{{org.apache.nutch.util.mime}}}
  

Mime
View raw message