lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Format Stripping [ was: XLS parser ]
Date Tue, 22 Jan 2002 17:39:01 GMT
> From: Andrew C. Oliver [mailto:acoliver@nc.rr.com]
>
> We've implemented an event based
> system for reading documents (so you register for what you care about
> and then kick it off and it throws events to listeners as it runs into
> them).  Not sure if there is a clean way to graft those ideas onto
> Lucene for a single pass read.

I'm not sure the metaphor is apt.  The listener pattern is used with
parsers.  Lucene is not a parser, but rather something that you'd like to
call from a parser.

For example, one might do something like the following to add text to a
Lucene index with a SAX parser:

  parser.setContentHandler(new ContentHandler() {
    private Document document = new Document();
    private String fieldName;
    public void startElement(String ns, String name, ...) {
      fieldName = name;
    }
    public void characters(char[] chars, int start, int len) {
      String text = new String(chars, start, len);
      document.add(Field.UnStored(fieldName, text);
    }    
  });

(Note that in Lucene a given field name may be added to a Document many
times, with the effect of appending the contained text chunks in the index.
The only proviso is that tokens will not span chunk boundaries.)

This code seems completely natural to me.  I'm not sure how an event-based
indexer would look in this context.

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message