lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Questions about doing a full text search with numeric values
Date Sat, 06 Jul 2013 12:55:01 GMT
I don't know the internal search algorithm something like Word
uses, but it may be really simple.

So, I type in the following in a Word document
trains training trainer train
and search for "trainer". Word does not stop on
"trains", it hasn't applied stemming apparently.

in the 000012345 case, it looks like Word simply scans
the input stream until, say, 12 are seen next to each
other (as characters) and highlights that word.

So you can probably mimic the Word search
experience by just not tokenizing much, try
using WhitespaceTokenizerFactory and
maybe LowercaseFilterFactory and maybe
something to remove punctuation in Solr/Lucene
and it might now match user's expectations.

But then you'll have users who are accustomed
to application X, Y or Z where the search
implementation is different. And _their_
expectations will be violated.

You can do a lot of things (the 000123456 case
can be handled by ngramming for instance) to try
to get what your users expect.

The root of this issue is that simple search algorithms
take the entire document text and scan it. So
decisions can be made on-the-fly. This doesn't
scale at all when you are looking at millions of
documents, but it can work on one doc at a time.

Hope that helps,
Erick


On Mon, Jul 1, 2013 at 4:53 PM, Todd Hunt <Todd.Hunt@nisc.coop> wrote:

> Erick,
>
> Sorry to be dense, but I'm not following this statement:
>
> " Pesky users. But it's a different problem, albiet one that we all
> struggle with. Just processing a single stream and asking "does this letter
> sequence occur" may not even tokenize the input, just find the character
> sequence."
>
> Are you stating that I *should* get each token that has been indexed to
> see if "this letter sequence occur(s)"  in it?  I'm thinking I'm missing
> something because that seems no different than using wildcards.  Or am I
> missing a subtle difference?
>
> Thank you.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Monday, July 01, 2013 5:23 AM
> To: java-user
> Subject: Re: Questions about doing a full text search with numeric values
>
> WordDelimiterFilter(Factory if you're experimenting with Solr as Jack
> suggests) will fix a number of your cases since it splits on case change
> and numeric/alpha changes. There are a bunch of ways to recombine things so
> be aware that it'll take some fiddling with the parameters. As Jack
> suggests, using Solr is a quick way to understand the effects of changing
> filters and you can move the filter construction back into straight Lucene
> quite easily.
>
> That won't deal with this example though: 000000123456.
> You could consider some custom filter that essentially puts in a synonym
> of 123456 as well as this token.
>
> bq: The expectation from our users is that if they can open the document
> in its default application (Word, Adobe, Notepad,
> etc.) and perform a "find" within that application and find the text, then
> our application based on Lucene should be able to find the same text.
>
> Pesky users. But it's a different problem, albiet one that we all struggle
> with. Just processing a single stream and asking "does this letter sequence
> occur" may not even tokenize the input, just find the character sequence.
>
> And you really don't want to put wildcards in front and back for other
> reasons besides it's a hack; it'll be very expensive...
>
> Best
> Erick
>
>
> On Thu, Jun 27, 2013 at 1:47 PM, Jack Krupansky <jack@basetechnology.com
> >wrote:
>
> > Do continue to experiment with Solr as a "testbed" - all of the
> > analysis filters used by Solr are... part of Lucene, so once you
> > figure things out in Solr (using the Solr Admin UI analysis page), you
> > can mechanically translate to raw Lucene API calls.
> >
> > Look at the standard tokenizer, it should do a better job with
> punctuation.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Todd Hunt
> > Sent: Thursday, June 27, 2013 1:14 PM
> > To: java-user@lucene.apache.org
> > Subject: Questions about doing a full text search with numeric values
> >
> >
> > I am working on an application that is using Tika to index text based
> > documents and store the text results in Lucene.  These documents can
> > range anywhere from 1 page to thousands of pages.
> >
> > We are currently using Lucene 3.0.3.  I am currently using the
> > StandarAnalyzer to index and search for the text that is contained in
> > one Lucene document field.
> >
> > For strictly alpha based, English words, the searches return the
> > results as expected.  The problem has to do with searching for numeric
> > values in the indexed documents.  So examples of text in the documents
> > that cannot be found unless wild cards are used are:
> >
> > Ø  1-800-costumes.com
> >
> > o   800 does not find the text above
> >
> > Ø  $118.30
> >
> > o   118 does not find the text above
> >
> > Ø  3tigers
> >
> > o   3 does not find the text above
> >
> > Ø  000000123456
> >
> > o   123456 does not find the text above
> >
> > Ø  123,abc,foo,bar,456
> >
> > o   This is in a CSV file
> >
> > o   123 nor 456 finds the text above
> >
> > §  I realize that it has to do with the texted only being separated by
> > commas and so it is treated as one token, but I think the issue is no
> > different than the others
> >
> > The expectation from our users is that if they can open the document
> > in its default application (Word, Adobe, Notepad, etc.) and perform a
> "find"
> > within that application and find the text, then our application based
> > on Lucene should be able to find the same text.
> >
> > It is not reasonable for us to request that our users surround their
> > search with wildcards.  Also, it seems like a kludge to
> > programmatically put wild cards around any numeric values the user may
> enter for searching.
> >
> > Is there some type of numeric parser or filter that would help me out
> > with these scenarios?
> >
> > I've looked at Solr and we already have strong foundation of code
> > utilizing Spring, Hibernate, and Lucene.  Trying to integrate Solr
> > into our application would take too much refactoring and time that
> > isn't available for this release.
> >
> > Also, since these numeric values are embedded within the documents, I
> > don't think storing them as their own field would make sense since I
> > want to maintain the context of the numeric values within the document.
> >
> > Thank you.
> >
> > ------------------------------**------------------------------**------
> > --- To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene
> > .apache.org> For additional commands, e-mail:
> > java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message