lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From findbestopensource <findbestopensou...@gmail.com>
Subject Re: How to parse & index different portions of an HTML page using Tika & Lucene ?
Date Tue, 11 Jan 2011 05:15:37 GMT
Your problem is more with tika. Pls post in tika user group.

If you want to deal with only HTML then better use html parser.
http://www.findbestopensource.com/search/?query=%22html+parser%22


On Tue, Jan 11, 2011 at 7:24 AM, amg qas <amgqas@gmail.com> wrote:

> I have been trying to parse & index different portions of an HTML page
> using
> Tika & Lucene. For eg. I would like to index text within <Title>, <H1>,
> <H2>, <A> tags
> of a HTML page separately and provide a different boost to each of them. I
> am using Tika for HTML parsing and creating a Document object with the
> appropriate fields
> that need to be indexed. However I could not find anything within Tika
> which
> would help me index the tags I want right out of the box.
>
> My code looks something like this :
>
> InputStream is = new FileInputStream(f);
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler(-1);
> ParseContext context = new ParseContext();
>  context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);
>
> try {
>  parser.parse(is, handler, metadata, context);
> } finally {
>  is.close();
> }
>
> Document doc = new Document();
> doc.add(new Field("contents", handler.toString(),
>  Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED));
>
> for (String name : metadata.names()) {
>  String value = metadata.get(name);
>
>  if (textualMetadataFields.contains(name)) {
>  doc.add(new Field("contents", value,
>    Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED));
>  }
>
>  doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES));
> }
>
> Stepping into Tika's HTML parsing code I found that it is
> org.apache.tika.parser.html.HtmlHandler class that fills up metadata
> object.
> I have the following questions :
> ·  Do I need to write a custom HTML handler to extract text within specific
> elements of a HTML page ?
> ·  Is there some class in Tika which can parse out text within different
> HTML tags that one specifies and fill up metadata object accordingly ? Can
> someone please provide
>   code samples for solutions that you propose ?
>
> Thanks,
> Amg
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message