lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsha1 <99harsha.h....@gmail.com>
Subject Re: Catogarization is possible in Lucene?
Date Mon, 29 Jun 2009 05:20:43 GMT

Hi Ted,
Thanks for your reply.

As you have mentioned in your reply, to search the document that have string
"Authors:" within 10 words of particular name. But this was just an example.
In real time I wont be knowing what would be the Title and how many names
are there are that.
Before hand we just will be knowing, there will be a colon in a paragraph,
just beside the colon is Title and just after the colon till a period will
be names. 
so wen i pass this paragraph i want to categorize this as, Title = , name1 =
, name2= ....

paragraph may be like this:
 
"Lucene can use the output of such a system, but does not support doing the
extraction itself.  Many times, full scale named entity extraction is not
really necessary and in those cases the phrase query in Lucene can help you
out. Friends: Ted Dunning, Sree HArsha."

So here,
Title will be Friends
Name1 will be Ted Dunning
Name2 will be Sree HArsha

To do this, I am looking for any special feature which can help me doing
this when compared to Java where we need to code from scratch. 


Ted Dunning wrote:
> 
> It sounds to me that what you are trying to do is information extraction.
> 
> Lucene can use the output of such a system, but does not support doing the
> extraction itself.  Many times, full scale named entity extraction is not
> really necessary and in those cases the phrase query in Lucene can help
> you
> out.  For instance, you might search for documents that have the string
> "Authors:" within 10 words of a particular name.  That will only retrieve
> documents, however, and would not, say, fill in an author table in a
> database.  You can help such a system by doing simple pre-processing
> during
> initial document processing and such a system can help in doing
> information
> extraction by finding documents that are likely to contain the information
> you need to extract.
> 
> I would recommend you look at the GATE system (if you want open source) or
> Lingpipe (if you can pay commercial prices or are doing research).
> 
> http://gate.ac.uk/
> http://alias-i.com/lingpipe/
> 
> On Fri, Jun 26, 2009 at 5:14 AM, Harsha1 <99harsha.h.n99@gmail.com> wrote:
> 
>>
>> Hi,
>> I went through the overview of Lucene and found its somewhat related to
>> text
>> searching and other stuffs.
>>
>> Please let me know if following can be done.
>>
>> Suppose i have a paragraph,
>> This is test program. I have done this using regex and some other
>> function
>> in groovy. But what I am looking is some kind of feature or template or
>> anything wherein I just mention the pattern in which i am interested in.
>> Based on the pattern mention groovy should automatically categorize the
>> fields.  Authors: Micheal Jackson, Daniel O Reily and Harsha.
>>
>> Format we are looking at is,
>> TITLE: NAME1 NAME2 NAME3
>>
>> In this case,
>> TITLE = Authors,
>> NAME1 = Micheal Jackson
>> NAME2 = Daniel O Reily
>> NAME3 = Harsha
>>
>> Like this, When i pass some paragraph, these fields(TITLE: NAME1 NAME2
>> NAME3) categorized automatically. Is it possible? (I have done in java
>> using
>> Regular expression, but we dont want to code from scratch, we want some
>> features from language will automatically do this. or with less code)
>> --
>> View this message in context:
>> http://www.nabble.com/Catogarization-is-possible-in-Lucene--tp24219314p24219314.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
> 
> 

-- 
View this message in context: http://www.nabble.com/Catogarization-is-possible-in-Lucene--tp24219314p24248514.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message