lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <>
Subject Re: index phrases
Date Sun, 02 Jan 2005 11:06:13 GMT

 Ce mai faci?
Cred ca te-am mai intrebat ce lucrezi,  dar am uitat.
In viitorul apropiat s-ar putea sa lucrez si eu putin in directia text 
si bineinteles ca am sa refolosesc ce e implementat in lucene.

 S-ar putea ca munca noastra sa aiba ceva puncte comune .. te 
intereseaza sa
mai schimbam cate o opinie din cand in cand?

 Numai bine,


Roxana Angheluta wrote:

>>> Dear all,
>>> I am using Lucene for indexing documents.
>>> I would like to include phrases (of a certain maximum length given  
>>> as a parameter) in the index. I know this is non-standard for e.g.  
>>> searching, where a PhraseQuery can be built which makes use of the  
>>> terms positions. However, I am not interested in searching, but  
>>> rather in using the indexing terms for some statistics.
>>> What would be an efficient way to do this? Is it possible to build  
>>> phrases in a filter after tokenization?
>> Roxana- could you give us a concrete example of what you're wanting  
>> to do?
>> A TokenFilter could certainly be used to aggregate multiple terms  
>> into a single term that represents a phrase.  This would happen  
>> during the analysis process, which occurs along with tokenization.
> Hi Erik, thanks for the answer.
> I would like to index the following document:
> This is a sample document.
> something like:
> "this"
> "is"
> "a"
> "sample"
> "document"
> "this is"
> "is a"
> "a sample"
> "this a"
> "is sample"
> "a document"
> "sample document"
> "this is a"
> "is a sample"
> "a sample document"
> In this example the maximum length of an n-gram is 3 and the length of 
> the moving window accross text is also 3.
> In fact I would like a full analyzer to do the job, i.e. define a 
> strategy to filter out/clean spurious n-grams: e.g. remove n-grams 
> made out only/partially of stopwords, eliminate just stopwords from 
> the n-gram.
> Sebastian has kindly provided his code, which does the job.
> roxana
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message