lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Facets, termvectors, relevancy and Multi word tokenizing
Date Fri, 28 Feb 2014 12:31:32 GMT
Hi,

Let's say you have accomplished what you want. You have a .txt with the tokens tomerge, like
"European" and "Parliament". What is your use case then? What is your high level goal? 

MappingCharFilter approach is closer (to your .txt approach) than PatternReplaceCharFilterFactory
approach. 

By the way, it could also be simulated with ShingleFilterFactory + KeepWordFilterFactory + TypeTokenFilterFactory

May be it can be done via firing phrase queries at query time (without interfering with the
index) at client side?  e.g. q="European Parliament"~0




On Friday, February 28, 2014 11:55 AM, epnRui <rui_bandarra@hotmail.com> wrote:
Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _  :)

So I guess there would be no way to do this more cleanly, maybe only
implementing my own Tokenizer and Filters, but I honestly couldn't find a
tutorial for implement a customized solr Tokenizer. If I end up needing to
do it I will write a tutorial.

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with <MD5Hash>European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

I had seen the ShingleFilterFactory but the problem is I don't need the
whole phrase in tokens of 2 words and I understood it's what it does. Of
course I would need some filter that would handle a .txt with the tokens to
merge, like "European" and "Parliament".

I'm still having some other problem now but maybe I find a solution after I
read the page you annexed which seems great. Solr is considering #European
as #European and European, meaning it does 2 facets for one token. I want it
to consider it only as #European. I ran the analyzer debugger in my Solr
admin console and I don't see how he can be doing that.
Would you know of a reason for this?

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html

Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message