lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bruce" <>
Subject lucene in combination with pattern recognition...
Date Tue, 20 Jun 2006 14:41:22 GMT

i'm looking at a problem and i can't figure out how to "easily" solve it...

basically, i'm trying to figure out if there's a way to use lucene/nutch
with some form of pattern matching to extract course information from a
College/Registrar's course section...

Assume I can point to a Regiatrar's section of a College site.
Assume I can then crawl through the section, and capture
 all the underlying information, including the Course
Is there a way to somehow use pattern matching/recognition
 to somehow interpret the DOM to pull out the class schedule
 information. I'm pretty sure there's no vanilla approach,
 so I'd even consider some kind of solution where I might
 have to intially evaluate/analyze the site, to tell it
 what DOM elements are "important"...

anyone done any work/projects like this...
any research/papers/sample apps i could look at...
any thoughts/comments/etc....

i could brute force this by writing a bunch of perl
scripts, with each script tied to a given registrar site,
but i'd like a more generalizable solution if one exists..



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message