lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From acoli...@apache.org
Subject cvs commit: jakarta-lucene/xdocs/stylesheets project.xml
Date Sat, 23 Feb 2002 22:02:55 GMT
acoliver    02/02/23 14:02:55

  Modified:    xdocs    whoweare.xml
               xdocs/stylesheets project.xml
  Added:       xdocs    luceneplan.xml
  Log:
  added application extensions plan and myself as a comitter
  
  Revision  Changes    Path
  1.3       +1 -0      jakarta-lucene/xdocs/whoweare.xml
  
  Index: whoweare.xml
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/xdocs/whoweare.xml,v
  retrieving revision 1.2
  retrieving revision 1.3
  diff -u -r1.2 -r1.3
  --- whoweare.xml	28 Sep 2001 20:47:31 -0000	1.2
  +++ whoweare.xml	23 Feb 2002 22:02:55 -0000	1.3
  @@ -38,6 +38,7 @@
   <li><b>Dave Kor</b> (davekor at apache.org)</li>
   <li><b>Jon Stevens</b> (jon at latchkey.com)</li>
   <li><b>Tal Dayan</b> (zapta at apache.org)</li>
  +<li><a href="http://www.trilug.org/~acoliver">Andrew C. Oliver</a> (acoliver
at apache dot org)</li>
   </ul>
   </section>
   
  
  
  
  1.1                  jakarta-lucene/xdocs/luceneplan.xml
  
  Index: luceneplan.xml
  ===================================================================
  <?xml version="1.0" encoding="UTF-8"?>
       
  <document>
    <properties>
     <title>Plan for enhancements to Lucene</title>
     <authors>
      <person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
     </authors>
    </properties>
    <body>
    
          <section name="Purpose">
                  <p>
                          The purpose of this document is to outline plans for
                          making <a href="http://jakarta.apache.org/lucene">
                          Jakarta Lucene</a> work as a more general drop-in
                          component.  It makes the assumption that this is an
                          objective for the Lucene user and development community.
                  </p>
                  <p>
                          The best reference is <a href="http://www.htdig.org">
                          htDig</a>, though it is not quite as sophisticated as
                          Lucene, it has a number of features that make it
                          desireable.  It however is a traditional c-compiled app
                          which makes it somewhat unpleasent to install on some
                          platforms (like Solaris!).
                  </p>
                  <p>
                          This plan is being submitted to the Lucene developer
                          community for an initial reaction, advice, feedback and
                          consent.  Following this it will be submitted to the
                          Lucene user community for support.  Although, I'm (Andy
                          Oliver) capable of providing these enhancements by 
                          myself, I'd of course prefer to work on them in concert 
                          with others.
                  </p>
                  <p>
                          While I'm outlaying a fairly large featureset, these can
                          be implemented incrementally of course (and are probably
                          best if done that way).
                  </p>
          </section>
    
          <section name="Goal and Objectives">
                  <p>
                          The goal is to provide features to Lucene that allow it
                          to be used as a dropin search engine.  It should provide
                          many of the features of projects like <a
                          href="http://www.htdig.org">htDig</a> while surpassing
                          them with unique Lucene features and capabillities such as
                          easy installation on and java-supporting platform,
                          and support for document fields and field searches.  And 
                          of course, <a href="http://apache.org/LICENSE">
                          a pragmatic software license</a>.
                  </p>
                  <p>
                          To reach this goal we'll implement code to support the
                          following objectives that augment but do not replace
                          the current Lucene featureset.  
                  </p>
                  <ul>
                          <li>
                                  Document Location Independance - meaning mapping
                                  real contexts to runtime contexts.
                                  Essentially, if the document is at
                                  /var/www/htdocs/mydoc.html, I probably want it
                                  indexed as
                                  http://www.bigevilmegacorp.com/mydoc.html.             
                  
                          </li>
                          <li>
                                  Standard methods of creating central indicies -
                                  file system indexing is probably less useful in
                                  many environments than is *remote* indexing (for
                                  instance http).  I would suggest that most folks
                                  would prefer that general functionality be
                                  suppored by Lucene instead of having to write
                                  code for every indexing project.  Obviously, if
                                  what they are doing is *special* they'll have to
                                  code, but general document indexing accross
                                  webservers would not qualify.
                          </li>
                          <li>
                                  Document interperatation abstraction - currently
                                  one must handle document object construction via
                                  custom code.  A standard interface for plugging
                                  in format handlers should be supported.  
                          </li>
                          <li>
                                  Mime and file-extension to document
                                  interperatation mapping.                               
  
                          </li>
                  </ul>
          </section>
          <section name="Indexers">
                  <p>
                          Indexers are standard crawlers.  They go crawl a file
                          system, ftp site, web site, etc. to create the index.
                          These standard indexers may not make ALL of Lucene's
                          functionality available, though they should be able to
                          make most of it available through configuration.
                  </p>
                  <!--<section name="AbstractIndexer">-->
                  <p>
                          <b> Abstract Indexer </b>
                  </p>
                          <p>
                                  The Abstract indexer is basically the parent for all
                                  Indexer classes.  It provides implementation for the
                                  following functions/properties:
                          </p>
                          <ul>
                                  <li>
                                          index path - where to write the index.
                                  </li>
                                  <li>
                                          cui - create or update the index
                                  </li>
                                  <li>
                                          root context - the start of the pathname
                                          that should be replaced by the
                                          replace with property or dropped
                                          entirely.  Example: /opt/tomcat/webapps
                                  </li>
                                  <li>
                                          replace with - when specified replaces
                                          the root context.  Example:
                                          http://jakarta.apache.org.
                                  </li>
                                  <li>
                                          replacement type - the type of
                                          replacewith path:  relative, url or
                                          path.
                                  </li>
                                  <li>
                                          location - the location to start
                                          indexing at.
                                  </li>
                                  <li>
                                          doctypes - only index documents with
                                          these doctypes.  If not specified all
                                          registered mime-types are used.
                                          Example: "xml,doc,html"
                                  </li>
                                  <li>
                                          recursive - if not specified is turned
                                          off.
                                  </li>
                                  <li>
                                          level - optional level of directory or
                                          links to traverse.  By default is
                                          assumed to be infinite.  Recursive must
                                          be turned on or this is ignored.  Range:
                                          0 - Long.MAX_VALUE.
                                  </li>
                                  <li>
                                          properties - in addition to the settings
                                          (probably from the command line) read
                                          this properties file and get them from
                                          it.  Command line options override
                                          the properties file in the case of 
                                          duplicates.  There should also be an
                                          enivironment variable or VM parameter to
                                          set this.
                                  </li>
                          </ul>
                  <!--</section>-->
                  <!--<s2 title="FileSystemIndexer">-->
                          <p>
                                <b>FileSystemIndexer</b>
                          </p>
                          <p>
                                  This should extend the AbstractIndexer and
                                  support any addtional options required for a
                                  filesystem index.
                          </p>
                  <!--</s2>-->
                  <!--<s2 title="HTTPIndexer">-->
                          <p>
  			      <b>HTTP Indexer </b>
                          </p>
                          <p>
                                  Supports the AbstractIndexer options as well as:       
                        
                          </p>
                          <ul>
                                  <li>
                                          span hosts - Wheter to span hosts or not,
                                          by default this should be no.                  
                     
                                  </li>
                                  <li>
                                          restrict domains - (ignored if span
                                          hosts is not enabled).  Whether all
                                          spanned hosts must be in the same domain
                                          (default is off).
                                  </li>
                                  <li>
                                          try directories - Whether to attempt
                                          directory listings or not (so if you
                                          recurse and go to
                                          /nextcontext/index.html this option says
                                          to also try /nextcontext to get the dir
                                          lsiting)
                                  </li>
                                  <li>
                                          map extensions -
                                          (always/default/never/fallback).  Wether
                                          to always use extension mapping, by
                                          default (fallback to mime type), NEVER
                                          or fallback if mime is not available
                                          (default).
                                  </li>
                                  <li>
                                          ignore robots - ignore robots.txt, on or
                                          off (default - off)
                                  </li>
                          </ul>
          <!--        </s2> -->
          </section>
          
          <section name="MIMEMap">
                  <p>
                          A configurable registry of document types, their
                          description, an identifyer, mime-type and file
                          extension.  This should map both MIME -> factory 
                          and extension -> factory.
                  </p>
                  <p>
                          This might be configured at compile time or by a
                          properties file, etc.  For example:
                  </p>
                          <table>
                                  <tr>
                                          <td>Description</td>
                                          <td>Identifier</td>
                                          <td>Extensions</td>
                                          <td>MimeType</td>
                                          <td>DocumentFactory</td>
                                  </tr>
                                  <tr>
                                          <td>"Word Document"</td>
                                          <td>"doc"</td>
                                          <td>"doc"</td>
                                          <td>"vnd.application/ms-word"</td>
                                          <td>POIWordDocumentFactory</td>
                                  </tr>
                                  <tr>
                                          <td>"HTML Document"</td>
                                          <td>"html"</td>
                                          <td>"html,htm"</td>
                                          <td></td>
                                          <td>HTMLDocumentFactory</td>
                                  </tr>                                
                          </table>
          </section>
          <section name="DocumentFactory">
                  <p>
                          An interface for classes which create document objects
                          for particular file types.  Examples:
                          HTMLDocumentFactory, DOCDocumentFactory,
                          XLSDocumentFactory, XML DocumentFactory.
                  </p>
          </section>
          <section name="FieldMapping classes">
                  <p>
                          A class taht maps standard fields from the
                          DocumentFactories into *fields* in the Document objects
                          they create.  I suggest that a regular expression system
                          or xpath might be the most universal way to do this.
                          For instance if perhaps I had an XML factory that
                          represented XML elements as fields, I could map content
                          from particular fields to ther fields or supress them
                          entirely.  We could even make this configurable.
                  </p>
                  <p>
                  
                          for example:
                  </p>
                  <ul>
                          <li>
                                  htmldoc.properties
                          </li>
                          <li>
                          suppress=*
                          </li>
                          <li>
                          author=content:g/author\:\ ........................................./
                          </li>
                          <li>
                          author.suppress=false
                          </li>
                          <li>
                          title=content:g/title\:\ ........................................./
                          </li>
                          <li>
                          title.suppress=false
                          </li>
                  </ul>
                  <p>                
                          In this example we map html documents such that all 
                          fields are suppressed but author and title.  We map 
                          author and title to anything in the content matching 
                          author: (and x characters).  Okay my regular expresions 
                          suck but hopefully you get the idea.
                  </p>
          </section>
          <section name="Final Thoughts">
                  <p>
                          We might also consider eliminating the DocumentFactory 
                          entirely by making an AbstractDocument from which the 
                          current document object would inherit from.  I 
                          experimented with this locally, and it was a relatively 
                          minor code change and there was of course no difference 
                          in performance.  The Document Factory classes would 
                          instead be instances of various subclasses of 
                          AbstractDocument.
                  </p>
                  <p>
                          My inspiration for this is HTDig (http://www.htdig.org/).  
                          While this goes slightly beyond what HTDig provides by 
                          providing field mapping (where HTDIG is just interested 
                          in Strings/numbers wherever they are found), it provides 
                          at least what I would need to use this as a dropin for 
                          most places I contract at (with the obvious exception of 
                          a default set of content handlers which would of course 
                          develop naturally over time).
                  </p>
                  <p>
                          I am able to certainly contribute to this effort if the 
                          development community is open to it.  I'd suggest we do 
                          it iteratively in stages and not aim for all of this at 
                          once (for instance leave out the field mapping at first).
                  </p>
                  <p>
                  
                          Anyhow, please give me some feedback, counter 
                          suggestions, let me know if I'm way off base or out of 
                          line, etc. -Andy
                  </p>
          </section>
                  
    </body>
  </document>
  
  
  
  1.7       +4 -0      jakarta-lucene/xdocs/stylesheets/project.xml
  
  Index: project.xml
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/xdocs/stylesheets/project.xml,v
  retrieving revision 1.6
  retrieving revision 1.7
  diff -u -r1.6 -r1.7
  --- project.xml	26 Jan 2002 16:15:22 -0000	1.6
  +++ project.xml	23 Feb 2002 22:02:55 -0000	1.7
  @@ -22,6 +22,10 @@
           <item name="Javadoc"           href="/api/index.html"/>
           <item name="Contributions"     href="/contributions.html"/>
       </menu>
  +    
  +    <menu name="Plans">
  +        <item name="Application Extensions"           href="/luceneplan.html"/>
  +    </menu>
   
       <menu name="Download">
           <item name="Binaries"           href="/site/binindex.html"/>
  
  
  

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message