lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <c...@alias-i.com>
Subject Re: lucene in combination with pattern recognition...
Date Thu, 22 Jun 2006 20:12:03 GMT
Check out Andrew McCallum's paper:

http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf

It mentions this very problem.  There are
also some more technical presentations around.

He was part of the Whiz-Bang team that took
on the problem.  The fact that the company's
out of business is a testament to how hard
this problem is in general.

- Bob Carpenter
   Alias-i

> 
> i'm looking at a problem and i can't figure out how to "easily" solve it...
> 
> basically, i'm trying to figure out if there's a way to use lucene/nutch
> with some form of pattern matching to extract course information from a
> College/Registrar's course section...
> 
> Assume I can point to a Regiatrar's section of a College site.
> Assume I can then crawl through the section, and capture
>  all the underlying information, including the Course
>  information...
> Is there a way to somehow use pattern matching/recognition
>  to somehow interpret the DOM to pull out the class schedule
>  information. I'm pretty sure there's no vanilla approach,
>  so I'd even consider some kind of solution where I might
>  have to intially evaluate/analyze the site, to tell it
>  what DOM elements are "important"...
> 
> anyone done any work/projects like this...
> any research/papers/sample apps i could look at...
> any thoughts/comments/etc....
> 
> i could brute force this by writing a bunch of perl
> scripts, with each script tied to a given registrar site,
> but i'd like a more generalizable solution if one exists..
> 
> thanks
> 
> -bruce
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message