lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben <...@autonomic.net>
Subject Re: Excluding characters from a wildcard query
Date Wed, 01 Jul 2009 11:21:11 GMT
I'm not quite sure I understand exactly what you mean.
The string I'm processing could have many tens of thousands of values... 
I hope you aren't implying I'd need to split it into many tens of 
thousands of "columns".

If you're saying what I think you're saying, you're saying that I should 
leave whitespaces between the individual parts of the string, pass in 
the string into a "multiValued" field and have SOLR internally treat 
each "word" as an individual entity? 

Thanks for your help with this...

Ben

Uwe Klosa wrote:
> To get the desired efffect I described you have to do the split before you
> send the document to solr. I'm not aware of an analyzer that can split one
> field value into several field values. The analyzers and tokenizers do
> create tokens from field values in many different ways.
>
> As I see it you have to do some preprocessing yourself.
>
> Uwe
>
> 2009/7/1 Ben <ben@autonomic.net>
>
>   
>> Is there a way in the Schema to specify that the comma should be used to
>> split the values up? e.g. Can I specify my "vector" field as multivalue and
>> also specify some sort of tokeniser to automatically split on commas?
>>
>> Ben
>>
>>
>>
>> Uwe Klosa wrote:
>>
>>     
>>> You should split the strings at the comma yourself and store the values in
>>> a
>>> multivalued field? Then wildcard search like A1_* are not a problem. I
>>> don't
>>> know so much about facets. But if they work on multivalued fields that
>>> should be then no problem at all.
>>>
>>> Uwe
>>>
>>> 2009/7/1 Ben <ben@autonomic.net>
>>>
>>>
>>>
>>>       
>>>> Yes, I had done that... however, I'm beginning to see now that what I am
>>>> doing is called a "wildcard query" which is going via Lucene's
>>>> queryparser.
>>>> Lucene's query parser doesn't not support the regexp idea of character
>>>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express
>>>> "Match
>>>> as many characters as possible, which are not underscores" with [^_]*
>>>>
>>>> Perhaps I'm going about my whole problem in an ineffective way, but I'm
>>>> not
>>>> sure how I can sensibly describe what I'm doing without it becoming a
>>>> long
>>>> document.
>>>>
>>>> The only other approach I can think of is to change what I'm indexing but
>>>> I'm not sure how to achieve that.
>>>> I've tried explaining it once, and obviously failed, so I'll try again.
>>>>
>>>> I'm given a string containing many vectors (where each dimension is
>>>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>>>
>>>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>>>
>>>> I want my facet query to tell me if, within one of the vectors within
>>>> that
>>>> string, there is a match for dimensions I'm interested in. Of the four
>>>> dimensions in this example, I may choose to fix an arbitrary number of
>>>> them
>>>> with values, and the rest with wildcards e.g. I might look for a facet
>>>> containing Ox_*_*_* so one of the vectors in the string must have its
>>>> first
>>>> dimension matching "Ox" and I don't care about the rest.
>>>>
>>>> ***Is there a way to break down this string on the comma's so that I can
>>>> apply a normal wildcard query and SOLR applies it to each
>>>> individually?***
>>>> That would solve all my problems :
>>>> e.g.
>>>> The string is internally represented in lucene/solr as
>>>> A1_B1_C1_D1
>>>> A2_B2_C2_D2
>>>> A3_B3_C3_D3
>>>>
>>>> where it tries to match the wildcard query on each in turn?
>>>>
>>>> Thanks for you help, I'm deeply confused about this at the moment...
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>>         
>>>
>>>       
>>     
>
>   


Mime
View raw message