lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <sk...@sloan.mit.edu>
Subject Re: Using Lucene for user query parsing
Date Mon, 09 Mar 2009 12:54:09 GMT
The BoW approach is simple and highly effective IMO. If you want to get a
bit fancy, you could also use a MultiField query in the combined index.

Another brute-force approach would be to hit all 3 indexes and see which
ones come back with the highest score(s).



On Mon, Mar 9, 2009 at 8:43 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> Sure, Lucene is suited. If....
>
> The central problem here isn't the search engine, IMO, it's
> figuring out what bits of the query are relevant to what
> parts of the data. That is, in some random string, what is
> the street, business name, address, etc.
>
> Lucene has nothing built in that I know of that'll help with
> that part. Once you *have* figured out what parts of the
> query relate to what fields in your index, the rest is easy.
> But you'll have to do the figuring out yourself.
>
> But you might try the bagowords I suggested before as
> a shortcut and see what kind of results you get. Sometimes
> simplistic solutions are "good enough", but that's always
> up to you to decide once you start seeing results.
>
> Best
> Erick
>
> On Mon, Mar 9, 2009 at 4:31 AM, Srinivas Bharghav
> <srini.bharghav@gmail.com>wrote:
>
> > Thanks for all the inputs guys.
> >
> > As Erick said let me elaborate the problem a bit.
> >
> > We are trying to develop a local search application. The user will be
> able
> > to locate businesses, localities and roads. We have data for all the 3
> with
> > us. We do not want to provide separate boxes for the user to enter data
> i.e
> > a common one for all entry box (a la google :)) where the user enters an
> > address (or road name or area name) or all the 3 etc etc. From the user
> > query we have to find the best possible match in our data. The data has
> > lots
> > of numbers as well as names with initials and stuff like that. The user
> may
> > enter the names with a space between the initals or they might club the
> > initials together etc etc. From the user query we do not have a way to
> > figure out what is what apart from the obvious ones as to if something
> ends
> > with a road then it is a road name or if there is a layout in the query
> > then
> > it is an area etc. Right now we have our own custom framework. I am
> trying
> > to figure out as to whether Lucene is suited for this kind of
> application.
> >
> > Once again thanks for all the inputs.
> >
> > On Fri, Mar 6, 2009 at 7:15 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Whatever you do will be wrong <G>. What you're saying is
> > > that you have structured data that the user wants to search
> > > in an unstructured way, and you want to try to create a
> > > system that intuits what the user meant. Good luck <G>.
> > >
> > > Can you back up a bit and talk about the problem you're
> > > trying to solve? If, for instance, you're trying to find the
> > > best match for a particular business, one approach would
> > > be to create one index where each business had
> > >
> > > street
> > > business
> > > area
> > > bagowords
> > >
> > > where the field bagowords contained a copy of the data
> > > from the other three fields, then search bagowords
> > > for your query terms. It sounds simplistic, but it might be
> > > surprisingly good.
> > >
> > > And if this is out in left field, a higher level statement
> > > of the problem would help get better answers.
> > >
> > > Best
> > > Erick
> > >
> > > On Fri, Mar 6, 2009 at 1:25 AM, Srinivas Bharghav
> > > <srini.bharghav@gmail.com>wrote:
> > >
> > > > I am trying to evaluate as to whether Lucene is the right candidate
> for
> > > the
> > > > problem at hand.
> > > >
> > > > Say I have 3 indexes:
> > > >
> > > > Index 1 has street names.
> > > > Index 2 has business names.
> > > > Index 3 has area names.
> > > >
> > > > All these names can be single words or a combination of words like
> > > woodward
> > > > street or marks and spencers street etc etc.
> > > >
> > > > Now the use enters a query saying "mc donalds woodward street
> kingston
> > > > precinct".
> > > >
> > > > I have to parse this query and come up with the best match possible.
> > The
> > > > problem is, in the query I do not know which part is the business
> name
> > or
> > > > area name or street name. Also the user may give the query in any
> order
> > > for
> > > > example he may give it as "kingston precinct mc donalds woodward
> > street".
> > > > There might be spelling mistkaes in the query enterd by the user.
> Also
> > he
> > > > might use road for street or lane for street and such things. I know
> > that
> > > > Lucene is the right candidate for the synonym and spelling mistakes
> > part
> > > > but
> > > > am a bit hazy regarding the user query parsing part as to in which
> > index
> > > to
> > > > search what. Any help is greatly appreciated.
> > > >
> > > > Thanks,
> > > > Srini.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message