lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: can't find common words -- using Lucene 3.4.0
Date Mon, 26 Mar 2012 14:58:58 GMT
Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the tokens it emits,
in which case your index will have things like "sentence." and "sentence?" in it, so querying
for "sentence" will not match.

Luke can tell you what's in your index: <http://code.google.com/p/luke/>

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some simple docs and
querries. I have 3 simples docs that are shown at the bottom of the this email between pairs
of "==================="s and about a dozen terms. One of them is "electricity". As you can
see, it appears in all three docs. However, when I search for it, I only get a hit in Doc
2 but not in Doc 1 or Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I have a bunch
of other querries that only appear in one of the three docs and these are all found correctly.


Is this an indication that I have either set parameers incorrectly when indexing or set up
the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag
toss game they play after practices, then mock a teammate who has inexplicably decided to
do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder
block walls in the symphony of a football team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a
strange image, loaded with contrasts. He doesn't belong here, not with these men, many of
whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the
star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried
by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes
the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has
carried the ball more than 12 times in only one game this season. Often he might have the
ball in his hands on only four or five plays, and this is fine with him. In fact he prefers
it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed every day by
countless thousands around the globe and yet I have never heard even one remotely legitimate
answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could
chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells
by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the
earliest presidential primary, fending off bigger states that claimed that the small New England
state was too white to represent the nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate
until they've met at least three times face-to-face _ rather than seeing the person in television
ads or at large events typical of bigger states. New Hampshire voters expect to shake hands
with candidates at coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor.
Jimmy Carter won in New Hampshire and went on to become president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United America. And Here's
another one: States america. And here's yet another == UNITED STATES! Here we are dropping
the middle stopword: United States		 America. Finally, we get one word: united. Then the second
one: STates. Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message