lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark McGuire <mmcgu...@hawk.iit.edu>
Subject Lucene 4 - POS and Syntactic Tagging
Date Wed, 14 Mar 2012 16:37:56 GMT
I'm working on a project where I need to tag both the part of speech and 
other syntactic information on tokens so that this information is 
searchable.  I have read the threads on the mailing list regarding part 
of speech tagging here 
<http://mail-archives.apache.org/mod_mbox/lucene-java-user/201105.mbox/%3CBANLkTimwqcQ_GF2pxE8Hyc_R75NcWDRWbQ@mail.gmail.com%3E>

and the many responses to similar questions.  To me, inserting 0 
increment tokens seems rather clunky, especially when TypeAttributes 
appear to be what one would want to use.  Does Lucene do anything extra 
when the Type is set to or not set to its default, "word"?  Is it 
possible to write a search that uses multiple attributes from 
TokenAttributes (ie a search that searches for CharTermAttribute "dog" 
followed by a TypeAttribute of verb)?

Also if I were to use 0 increment tokens for tagging, would data like 
document length or sumTotalTermFreq be different from a document indexed 
without these tags?  How would I counteract these differences if any occur?

Thanks,
Mark McGuire

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message