lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Extracting contact data
Date Thu, 14 Jan 2010 14:45:04 GMT
> > Do you think I can get any advantage from building a solution on
> Lucene?

Lucene is generally about information retrieval not information extraction (as suggested,
GATE or UIMA are more commonly used for extraction).
However, Lucene can play a role in extraction if you use it for determining probabilities
rather than using purely rule-based extraction techniques such as regex.
A Lucene index provides fast look-ups of term frequencies and can therefore help inform the
likelihood that a word is being used in a particular context, given large volumes of training
data.
You'll need to get creative about what sources of existing pre-tagged data might be useful
for training and write a bunch of custom code but in my experience Lucene can be useful for
extraction when used in this context.

Cheers,
Mark




----- Original Message ----
From: Julien Nioche <lists.digitalpebble@gmail.com>
To: java-user@lucene.apache.org
Sent: Thu, 14 January, 2010 12:41:01
Subject: Re: Extracting contact data

Hi,

Tools like GATE (http://www.gate.ac.uk) or Apache UIMA would be good
candidates for what you are trying to achieve.

HTH
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2010/1/14 Ortelli, Gian Luca <gianluca.ortelli@truvo.com>

>
> Well, the exact definition we're going to find out empirically,
> as we run an implementation through our data and look at the quality
> of results... For now, I would use the number of tokens between the
> finding ("abc@def.com") and the word that gives context ("Contact").
>
> Anyway, replying to karl: I'm not searching for a given
> email/street/time interval/etc., I need to extract EVERY
> email/street/time interval/etc. from the text. The kind of need for
> which you suggest a natural language processing tool.
>
>  Gianluca
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, January 13, 2010 6:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: Extracting contact data
>
> Before answering, how to you measure "proximity"? You can make
> Lucene work with locations (there's an example in Lucene In Action)
> readily enough though....
>
> HTH
> Erick
>
> On Wed, Jan 13, 2010 at 11:39 AM, Ortelli, Gian Luca <
> gianluca.ortelli@truvo.com> wrote:
>
> > Hi community,
> >
> >
> >
> > I have a general understanding of Lucene concepts, and I'm wondering
> if
> > it's the right tool for my job:
> >
> >
> >
> > - I need to extract data like e.g. time intervals ("8am - 12pm"),
> street
> > addresses from a set of files. The common issue with this data unit is
> > that they contain spaces and are not always definable through regexes.
> >
> >
> >
> > - the extraction must take into consideration the "proximity": for
> > example, a mail address which is close to the work "Contacts" will
> > receive a higher rank, since I'm looking for contact data.
> >
> >
> >
> > Do you think I can get any advantage from building a solution on
> Lucene?
> >
> >
> >
> >  Gianluca
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message