lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Courtenage <>
Subject Re: lucene in combination with pattern recognition...
Date Thu, 22 Jun 2006 21:08:25 GMT
You might also check out an old paper by Kruger, Giles, Lawrence et al. on
a search engine called Deadliner (see here at
Deadliner crawled for Calls for Papers for conferences, using Support 
Vector Machines trained to
recognise relevant pages, and then applying sets of regular expressions 
to extract information
from the CFP pages.  Lawrence is now with Google, I believe.

Hope this helps,


Bob Carpenter wrote:
> Check out Andrew McCallum's paper:
> It mentions this very problem.  There are
> also some more technical presentations around.
> He was part of the Whiz-Bang team that took
> on the problem.  The fact that the company's
> out of business is a testament to how hard
> this problem is in general.
> - Bob Carpenter
>   Alias-i
>> i'm looking at a problem and i can't figure out how to "easily" solve 
>> it...
>> basically, i'm trying to figure out if there's a way to use lucene/nutch
>> with some form of pattern matching to extract course information from a
>> College/Registrar's course section...
>> Assume I can point to a Regiatrar's section of a College site.
>> Assume I can then crawl through the section, and capture
>>  all the underlying information, including the Course
>>  information...
>> Is there a way to somehow use pattern matching/recognition
>>  to somehow interpret the DOM to pull out the class schedule
>>  information. I'm pretty sure there's no vanilla approach,
>>  so I'd even consider some kind of solution where I might
>>  have to intially evaluate/analyze the site, to tell it
>>  what DOM elements are "important"...
>> anyone done any work/projects like this...
>> any research/papers/sample apps i could look at...
>> any thoughts/comments/etc....
>> i could brute force this by writing a bunch of perl
>> scripts, with each script tied to a given registrar site,
>> but i'd like a more generalizable solution if one exists..
>> thanks
>> -bruce
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email:   Web: |

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message