lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Lucene 4 - POS and Syntactic Tagging
Date Tue, 10 Apr 2012 07:22:00 GMT
Hi,

A simple approach to get this is by making the "type" part of the term text.
This does not hurt your search, because this adding of the type would be
done both on query and search side (the Analyzer simply appends the type to
the term text for both sides): "term#word". Of course you can have a second
field without that additional information (if you want to search without
that).

Appending the term type to the term can be done with a TokenFilter that
calls termAttribute.append("#").append(typeAttribute.getType()). Be sure to
use this analyzer on both the query and the indexing side, possibly with
PerFieldAnalyzerWrapper to limit it to specific fields.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Mark McGuire [mailto:mmcguir3@hawk.iit.edu]
> Sent: Wednesday, March 14, 2012 5:38 PM
> To: java-user@lucene.apache.org
> Subject: Lucene 4 - POS and Syntactic Tagging
> 
> I'm working on a project where I need to tag both the part of speech and
other
> syntactic information on tokens so that this information is searchable.  I
have
> read the threads on the mailing list regarding part of speech tagging here
> <http://mail-archives.apache.org/mod_mbox/lucene-java-
> user/201105.mbox/%3CBANLkTimwqcQ_GF2pxE8Hyc_R75NcWDRWbQ@mail.g
> mail.com%3E>
> and the many responses to similar questions.  To me, inserting 0 increment
> tokens seems rather clunky, especially when TypeAttributes appear to be
what
> one would want to use.  Does Lucene do anything extra when the Type is set
to
> or not set to its default, "word"?  Is it possible to write a search that
uses
> multiple attributes from TokenAttributes (ie a search that searches for
> CharTermAttribute "dog"
> followed by a TypeAttribute of verb)?
> 
> Also if I were to use 0 increment tokens for tagging, would data like
document
> length or sumTotalTermFreq be different from a document indexed without
> these tags?  How would I counteract these differences if any occur?
> 
> Thanks,
> Mark McGuire


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message