lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From amg qas <amg...@gmail.com>
Subject How to parse & index different portions of an HTML page using Tika & Lucene ?
Date Tue, 11 Jan 2011 01:54:33 GMT
I have been trying to parse & index different portions of an HTML page using
Tika & Lucene. For eg. I would like to index text within <Title>, <H1>,
<H2>, <A> tags
of a HTML page separately and provide a different boost to each of them. I
am using Tika for HTML parsing and creating a Document object with the
appropriate fields
that need to be indexed. However I could not find anything within Tika which
would help me index the tags I want right out of the box.

My code looks something like this :

InputStream is = new FileInputStream(f);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
 context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);

try {
 parser.parse(is, handler, metadata, context);
} finally {
 is.close();
}

Document doc = new Document();
doc.add(new Field("contents", handler.toString(),
  Field.Store.NO, Field.Index.ANALYZED));

for (String name : metadata.names()) {
 String value = metadata.get(name);

 if (textualMetadataFields.contains(name)) {
  doc.add(new Field("contents", value,
    Field.Store.NO, Field.Index.ANALYZED));
 }

 doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES));
}

Stepping into Tika's HTML parsing code I found that it is
org.apache.tika.parser.html.HtmlHandler class that fills up metadata object.
I have the following questions :
·  Do I need to write a custom HTML handler to extract text within specific
elements of a HTML page ?
·  Is there some class in Tika which can parse out text within different
HTML tags that one specifies and fill up metadata object accordingly ? Can
someone please provide
   code samples for solutions that you propose ?

Thanks,
Amg

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message