lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Indexing/Querying Annotations and Fields for a document
Date Tue, 18 Mar 2008 00:24:02 GMT
You would parse the XML (or whatever) into separate strings, and put  
each piece into it's own Field in a Lucene Document

For instance:

Document doc = new Document();
String body = getBody(input);
String people = getPeople(input)
Field body = new Field("body", body);
Field people = new Field("people", people);

writer.addDocument(doc)


Essentially, you just need to implement the getPeople and getBody  
methods to extract the appropriate content from your text.


On Mar 17, 2008, at 5:05 PM, lucene-seme1 s wrote:

> I already have the document preprocessed and the annotations (i.e.
> <Person>John</Person>) are already stored in an array with features  
> attached
> to some annotations (such as the root and lemma of the word). Can  
> you please
> elaborate some more on how to "index them as normally would" ?
>
> Regards,
> JK
>
>
> On Mon, Mar 17, 2008 at 4:33 PM, Grant Ingersoll <gsingers@apache.org>
> wrote:
>
>> I think there are a couple of ways you can approach this, although I
>> have never used GATE.
>>
>> If these annotations are marked in line in your content, then you can
>> either preprocess the files to have them separately and index as you
>> normally would, or you can use the relatively new TeeTokenFilter and
>> SinkTokenizer to extract them as you go for use in other fields.  I
>> have done this successfully for some apps that I have worked on and I
>> think it works quite nice and beats preprocessing IMO.  Essentially,
>> you set up a TeeTokenFilter that recognizes your Person and then set
>> that token aside in the Sink.  Then, when you construct the Person
>> field, you use the SinkTokenizer.
>>
>> HTH,
>> Grant
>>
>> On Mar 17, 2008, at 8:54 AM, lucene-seme1 s wrote:
>>
>>> Hello,
>>>
>>> I am a newbie here and still experimenting with Lucene. I have
>>> annotations
>>> and features generated by GATE for many documents and would like to
>>> index
>>> the original content of the documents in addition to the generated
>>> annotations. The annotations are in the form of [<Person> John </
>>> Person>
>>> loves fishing]. I would like to be able to search using the Person
>>> attribute.
>>>
>>> Any hint or suggestion is highly appreciated
>>>
>>> regards,
>>> JK
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucenebootcamp.com
>> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message