lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
Date Tue, 27 Nov 2007 20:32:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545995
] 

Grant Ingersoll commented on LUCENE-1058:
-----------------------------------------

{quote}
Maybe I'm missing something?
{quote}

No, I don't think you are missing anything in that use case, it's just an example of its use.
 And I am not totally sold on this approach, but mostly am :-) 

I had originally considered your option, but didn't feel it was satisfactory for the case
where you are extracting things like proper nouns or maybe it is generating a category value.
 The more general case is where not all the tokens are needed (in fact, very few are).  In
those cases, you have to go back through the whole list of cached tokens in order to extract
the ones you want.  In fact, thinking some more of on it, I am not sure my patch goes far
enough in the sense that what if you want it to buffer in mid stream.  

For example, if you had:
StandardTokenizer
Proper Noun TF
LowerCaseTF
StopTF

and Proper Noun TF is solely responsible for setting aside proper nouns as it comes across
them in the stream.

As for the convoluted cross-field logic, I don't think it is all that convoluted.  There are
only two fields and the implementing Analyzer takes care of all of it.  Only real requirement
the application has is that the fields be ordered correctly.  

I do agree somewhat about the pre-analysis approach, except for the case where there may be
a large number of tokens in the source field, in which case, you are holding them around in
memory (maxFieldLength mitigates to some extent.)  Also, it puts the onus on the app. writer
to do it, when it could be pretty straight forward for Lucene to do it w/o it's usual analysis
pipeline.

At any rate, separate of the CollaboratingAnalyzer, I do think the CachedTokenFilter is useful,
especially in supporting the pre-analysis approach.



> New Analyzer for buffering tokens
> ---------------------------------
>
>                 Key: LUCENE-1058
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1058
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch,
LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that could siphon
off certain tokens and store them in a buffer to be used later in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but all the
other analysis is the same, then you could save off the tokens to be output for a different
field.
> Patch to follow, but I am still not sure about a couple of things, mostly how it plays
with the new reuse API.
> See http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message