lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: lucene in combination with pattern recognition...
Date Thu, 22 Jun 2006 20:12:03 GMT
Check out Andrew McCallum's paper:

It mentions this very problem.  There are
also some more technical presentations around.

He was part of the Whiz-Bang team that took
on the problem.  The fact that the company's
out of business is a testament to how hard
this problem is in general.

- Bob Carpenter

> i'm looking at a problem and i can't figure out how to "easily" solve it...
> basically, i'm trying to figure out if there's a way to use lucene/nutch
> with some form of pattern matching to extract course information from a
> College/Registrar's course section...
> Assume I can point to a Regiatrar's section of a College site.
> Assume I can then crawl through the section, and capture
>  all the underlying information, including the Course
>  information...
> Is there a way to somehow use pattern matching/recognition
>  to somehow interpret the DOM to pull out the class schedule
>  information. I'm pretty sure there's no vanilla approach,
>  so I'd even consider some kind of solution where I might
>  have to intially evaluate/analyze the site, to tell it
>  what DOM elements are "important"...
> anyone done any work/projects like this...
> any research/papers/sample apps i could look at...
> any thoughts/comments/etc....
> i could brute force this by writing a bunch of perl
> scripts, with each script tied to a given registrar site,
> but i'd like a more generalizable solution if one exists..
> thanks
> -bruce
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message