lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: More IP/MAC indexing questions
Date Wed, 01 Aug 2007 18:20:44 GMT
Think of a custom analyzer class rather than an custom query parser. The
QueryParser uses your analyzer, so it all just "comes along".

Here's the approach I'd try first, off the top of my head....

Yes, break the IP and etc. up into octets and index them
tokenized.

Use a SpanNearQuery with a slop of 0 and specify true for ordering.
What that will do is require that the segments you specify must appear
in order with no gaps. You have to construct this yourself since there's
no support for SpanQueries in the QueryParser yet. This'll avoid having
to deal with Wildcards, which have their own issues (try searching on
a thread "I just don't understand wildcards at all" for an exposition from
"the guys" on this.

Best
Erick

On 8/1/07, Joe Attardi <jattardi@gmail.com> wrote:
>
> Hi Erick,
>
> First, consider using your own analyzer and/or breaking the IP addresses
> > up by substituting ' ' for '.' upon input.
>
> Do you mean breaking the IP up into one token for each segment, like
> ["192",
> "168", "1", "100"] ?
>
>
>
> > But on to your question. Please post what you mean by
> > "a large number". 10,000? 1,000,000,000? we have no clue
> > from your posts so far...
>
> I apologize for the lack of details. A large part of the data will be
> wireless MAC addresses detected over the air, so it depends on the site.
> But
> I suppose, worst case, we're looking at thousands or tens of thousands.
> Comparatively speaking, then, I guess it's not such a large number
> compared
> to some of the other questions discussed on the list.
>
> That said, efficiency is hugely overrated at this stage of your
> > design. I'd personally use whatever is easiest and run some
> > tests.
> >
> > Just index them as single (unbroken) tokens to start and search
> > your partial address with PrefixQuery.
>
> This is what I was thinking originally, too. Although there could be times
> where they are searching for a piece at the end of the address, which is
> why
> my original post had me building a WildcardQuery.
>
> The system will be searching log messages, too, and for that I'll use the
> more normal StandardAnalyzer/QueryParser approach.
>
> So what I am thinking of doing going forward is creating a custom query
> parser class, that basically has special cases (IP addresses, MAC
> addresses)
> where the query must be more customized, and in the other cases fall
> through
> to the standard QueryParser class. Does this sound like a good idea?
>
> Thanks again for your continued help!
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message