Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 17434 invoked from network); 30 Jul 2009 14:38:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Jul 2009 14:38:13 -0000 Received: (qmail 55714 invoked by uid 500); 30 Jul 2009 14:38:11 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 55663 invoked by uid 500); 30 Jul 2009 14:38:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 55653 invoked by uid 99); 30 Jul 2009 14:38:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jul 2009 14:38:11 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.230.240.48] (HELO eastrmmtao106.cox.net) (68.230.240.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jul 2009 14:38:01 +0000 Received: from eastrmimpo03.cox.net ([68.1.16.126]) by eastrmmtao106.cox.net (InterMail vM.7.08.02.01 201-2186-121-102-20070209) with ESMTP id <20090730143740.EGSD29503.eastrmmtao106.cox.net@eastrmimpo03.cox.net>; Thu, 30 Jul 2009 10:37:40 -0400 Received: from eastrmwml43 ([172.18.19.203]) by eastrmimpo03.cox.net with bizsmtp id NEdg1c00B4NtFms02Edgri; Thu, 30 Jul 2009 10:37:40 -0400 X-VR-Score: -200.00 X-Authority-Analysis: v=1.0 c=1 a=5FlMPbC-AAAA:8 a=kviXuzpPAAAA:8 a=mV9VRH-2AAAA:8 a=279vBvUpP9uQwxdjLOwA:9 a=6pEnB9L10wZwHdH6cQIA:7 a=OeadSg4193HULZzsUQB0i_8NOEIA:4 a=9IoNP7fiwc0A:10 a=snjK10GI3ygA:10 a=4vB-4DCPJfMA:10 a=OYROZjplntuEl393:21 a=vgxfi5ajaQT8jMpA:21 X-CM-Score: 0.00 Received: from 72.196.195.196 by webmail.east.cox.net; Thu, 30 Jul 2009 10:37:40 -0400 Message-ID: <20090730103740.W81N9.23516.imail@eastrmwml43> Date: Thu, 30 Jul 2009 10:37:40 -0400 From: To: java-user@lucene.apache.org Subject: Re: How to index IP addresses? Cc: Matthew Hall In-Reply-To: <4A71A808.1090808@informatics.jax.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) Sensitivity: Normal X-Virus-Checked: Checked by ClamAV on apache.org Hi Matthew and Narcis, I think that I found the (original) problem. It looks like the reason that I was getting all those other terms, which looked to me like the octets, weren't the octets :)... When I was doing the doc.add(), there were some other numbers (not IP addresses) in the String that I was passing to doc.add(...). BTW, I did try Narcis' suggestion, changing to NOT_ANALYZED, before I found my problem, and that looked like it made the entire string that I was passing to doc.add(...) as the term, which then, when I searched, didn't get any results. So, I think the original ANALYZED is ok. Sorry about that!! Jim ---- Matthew Hall wrote: > I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as a > term, and then also the octets. > > Are you adding the "contents" field into the index multiple times, > possibly with separate analyzers? > > Could you possibly try a test, very simple case? > > Just create an index with a single lucene document, with that documents > contents being "aa.bb.cc.dd" and then take a look at the index via Luke > again. > > When you look at the terms section (Its what comes up by default) you > SHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLY > terms in the index). This could vary depending on your analyzer, as > some will show an index containing only a single term "aa.bb.cc.dd". > What I would not expect is an index that would contain both. > > Furthermore by making the field not analyzed you will now have a > trickier time searching for it. As you will need to use a keyword > analyzer or something similar to search, which if I'm understanding the > spirit of your problem isn't really something that you want to do. > > So, if you could run that test scenario that I've outlined for you I > think you should be able to have a nice test bed to see what the results > of swapping to different analyzers will have on the data that you are > trying to index. Then, after you have played with that a bit you should > be able to re-expand your corpus again, and see if the analyzer you have > chosen continues to stand up. > > I.. had thought that StandardAnalyzer already kept IP addresses together > as a single token, but maybe its doing something... special and > interesting and thusly you are seeing the behavior that you are describing. > > Matt > > ohaya@cox.net wrote: > > Hi, > > > > Oh. Ok, thanks! I'll give that a try. > > > > Jim > > > > > > ---- "Armasu wrote: > > > >> Keyword: Field.Index.NOT_ANALYZED > >> > >> -----Original Message----- > >> From: ohaya@cox.net [mailto:ohaya@cox.net] > >> Sent: Thursday, July 30, 2009 4:36 PM > >> To: java-user@lucene.apache.org > >> Subject: How to index IP addresses? > >> > >> Hi, > >> > >> I am trying to index information in some proprietary-formatted files. > >> > >> In particular, these files contain some IP addresses in dotted notation, e.g., aa.bb.cc.dd. > >> > >> For my initial test, I have a Document implementation, and after I extract what I need into a String named "Info", I do: > >> > >> doc.add(new Field("contents", Info, Field.Store.YES, Field.Index.ANALYZED)); > >> > >> From looking at the resulting index using Luke, it appears that I am getting terms for the full IP address string (e.g., "aa.bb.cc.dd"), but I am also getting terms for each octet of each IP address string, e.g.: > >> > >> aa > >> bb > >> cc > >> dd > >> > >> I'm still just getting started with Lucene, but from the research that I've done, it seems like Lucene is treating the "." in the dotted notation strings as "noise". Is that correct? > >> > >> If so, is there a way to get it not to do that? > >> > >> Thanks, > >> Jim > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> > > > > > > Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street, floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania. Registration number J40/12967/2005. > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > -- > Matthew Hall > Software Engineer > Mouse Genome Informatics > mhall@informatics.jax.org > (207) 288-6012 > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org