lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: ****SPAM(5.0)**** Re: How to index part numbers
Date Fri, 28 Jan 2011 13:51:48 GMT
I wonder if you can define the problem away? It sounds like
you have essentially random input here. That is, the users
can put in whatever they want so whatever you do will be wrong
sometime. Could you sidestep the problem with auto-complete
and prefix queries (essentially adding * to the user's input)?

That way, the user would see the exact input (A123-56 in
your example).

This assumes there's some kind of GUI front end, so I may
be way off base....

You could still let them search free-form if they really wanted,
but you wouldn't then have to try to figure out what the user
meant when they added A123,5,7....

FWIW
Erick


On Fri, Jan 28, 2011 at 7:45 AM, Wulf Berschin <berschin@dosco.de> wrote:

> Hi Karolina,
>
> yes (of course!) We have an XML element for the part numbers, but upto now
> they are not all tagged thus we need regex matching as well...
>
> Am 28.01.2011 13:31, schrieb Karolina Bernat:
>
>> Hi Wulf,
>>
>> can I ask, if it is structured documentation (like XML or SGML) you're
>> dealing with? It's because I also work with technical documentation and we
>> do exactly, waht you're asking for, but it is XML-data.
>>
>>
>> On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin<berschin@dosco.de>  wrote:
>>
>>  Hi,
>>>
>>> I'm poking in the dark and hope someone has some light...
>>>
>>> We have part numbers in technical documentation to retrieve. For now we
>>> have a (long) regular expression to find those in a string. The part
>>> numbers
>>> have letters, digits and (redundant) whitespace. Furthermore authors
>>> often
>>> used a compressed notation for number ranges with dashes or slashes, like
>>> A123-56 or A123/4.
>>>
>>> When searching for part numbers users should be able to enter specific
>>> numbers like A126 (then the text "A123-56" should be found too) or
>>> wildcard
>>> searches like "A12?" or "A*". This part number seach is a separate
>>> feature
>>> apart from regular full text search.
>>>
>>> As far I see I have to
>>>
>>> - add an extra field for storing part numbers
>>>
>>> - create a Tokenizer which recognizes just the part numbers and skips all
>>> other text
>>>
>>> - create an Analyzer which expands ranges like A123-56 to A123, A124,
>>> ...,
>>> A156 and normalizes numbers by remving whitespace
>>>
>>> With this analyzer I hope to get the highlighting to work too (e.g.
>>> "A123-56" highlighted when "A126" was the search term).
>>>
>>> Is this the right way? What could I use as starting point (I found
>>> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
>>> more than I need...)
>>>
>>> Thanks for all hints!
>>>
>>> Wulf
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>
> --
>
> Mit freundlichen Grüßen,
>
> Wulf Berschin
>
> --
>
> <!-- *****************************************************************
> * Wulf Berschin                            Telefon: +49 6221 1486 16 *
> * DOSCO Document Systems Consulting GmbH   Telefax: +49 6221 1486 19 *
> * Mannheimer Strasse 1                     E-Mail: berschin@dosco.de *
> * 69115 Heidelberg, Germany                http://www.dosco.de       *
> * Handelsregister: Heidelberg HRB 335122                             *
> * Geschäftsführung: Robert Erfle                                     *
> ****************************************************************** -->
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message