lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Olenin" <VOle...@cihi.ca>
Subject does anyone know of a 'smart' categorizing text pattern finder?
Date Tue, 26 Sep 2006 01:49:31 GMT

Hi,

I wonder if anyone here knows if there is a 'smart' text pattern finder, ideally written in
Java. The library I'm looking for should be able to 'guess' the category of the particular
text on the page, most probably by finding similarities between the bulk of the pages and
a set of templates.

Eg, many forums are powered by phpbb, which structures 99% of the pages (except for some title
pages & user profile pages) in a very similar fashion (page is broken into blocks, each
block is broken into further blocks, etc). By comparing many pages with each other (eg, from
the same domain root: forum.springframework.org) it should be possible to detect common ('template
decorations') and page specific (actual content, like 'user name' and 'posting body') parts.
After that it should further be possible, by comparing 'template decorations' parts to a set
of templates, to 'guess' the nature of each of the 'page specific' block (eg, 'Vladimir Olenin'
in the left side column will be marked as 'name', while whatever is adjucent to this column
is the post body).

So, I wonder if anyone knows of a package capable of such things. Primary goal though is simplier:
to be able to parse out just posters' names from message boards. Though sometimes the 'block
category' can be derived from CSS class name of the tags around the text, it's very often
not the case.

Might Nutch have similar functionality built into their crawler?

Thanks.

Vlad

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message