lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Is there a limit for a field size in Lucene 3.0.2
Date Thu, 21 Feb 2013 19:48:43 GMT
There's an overridable default of 10,000 tokens, that's the first place I'd
look. Forget just how to set it to a higher value....

Best
Erick.

P.S. Please don't hit reply to a message and change the title, but start an
e-mail fresh. See: http://people.apache.org/~hossman/#threadhijack


On Thu, Feb 21, 2013 at 11:59 AM, Mark Wilson <mw8@sanger.ac.uk> wrote:

> I am having an issue with an old Search Application we are using.
>
> We have a Search App (using Lucene 3.0.2) that queries an index generated
> by
> Nutch 1.3. There is a really long page (approx 124kb ) that is crawled and
> inserted into the index, but when I search for it, (using a web-based
> application based on Lucene 3.0.2) only the top ~20% of the page content is
> coming back with results.
>
> If I open the index up using Luke-1.0.1, I can see all the contents of the
> field, but if I search for a term that I know is in there, and it's not in
> the top ~20% of the page, it comes back blank.
>
> So my question is, Is there a size limit for a field in Lucene 3.0.2
>
> Regards Mark
>
>
> On 21/02/2013 15:14, "Dyer, James" <James.Dyer@ingramcontent.com> wrote:
>
> > Samuel,
> >
> > Do you think you could write a failing unit test and open a JIRA issue?
>  Or at
> > the least open a JIRA issue with all the details without a test?
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Samuel García Martínez [mailto:samuelgmartinez@gmail.com]
> > Sent: Thursday, February 21, 2013 2:33 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: possible bug on Spellchecker
> > Importance: Low
> >
> > I'm using Solr 3.6 and DirectSpellchecker is available only on v4+.
> > Moreover, in "big" indexes i prefer using sidekick index rather than
> > iterating over term dictionary.
> >
> >
> > On Thu, Feb 21, 2013 at 8:19 AM, Jack Krupansky
> > <jack@basetechnology.com>wrote:
> >
> >> Any reason that you are not using the DirectSpellChecker?
> >>
> >> See:
> >> http://lucene.apache.org/core/**4_0_0/suggest/org/apache/**
> >> lucene/search/spell/**DirectSpellChecker.html<
> http://lucene.apache.org/core/4
> >> _0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html>
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Samuel García Martínez
> >> Sent: Wednesday, February 20, 2013 3:34 PM
> >> To: java-user@lucene.apache.org
> >> Subject: possible bug on Spellchecker
> >>
> >>
> >> Hi all,
> >>
> >> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on
> lucene
> >> Spellchecker) behaviour i think i found a bug when the input is a 6
> letter
> >> word:
> >>  - george
> >>  - anthem
> >>  - argued
> >>  - fluent
> >>
> >> Due to the getMin() and getMax() the grams indexed for these terms are 3
> >> and 4. So, the fields would be something like this:
> >>  - for "*george*"
> >>
> >>     - start3: "geo"
> >>     - start4: "geor"
> >>     - end3: "rge"
> >>     - end4: "orge"
> >>     - 3: "geo", "eor", "org", "rge"
> >>     - 4: "geor", "eorg", "orge"
> >>  - for "*anthem*"
> >>
> >>     - start3: "ant"
> >>     - start4: "anth"
> >>     - end3: "tem"
> >>     - end4: "them"
> >>
> >> The problem shows up when the user swap 3rd a 4th characters,
> misspelling
> >> the word like this:
> >>  - geroge
> >>  - anhtem
> >>
> >> The queries generated for this terms are: (SHOULD boolean queries)
> >> - for "*geroge*"
> >>
> >>  - start3: "ger"
> >>  - start4: "gero"
> >>  - end3: "oge"
> >>  - end4: "roge"
> >>  - 3: "ger", "ero", "rog", "oge"
> >>  - 4: "gero", "erog", "roge"
> >> - for "*anhtem*"
> >>
> >>  - start3: "anh"
> >>  - start4: "anht"
> >>  - end3: "tem"
> >>  - end4: "htem"
> >>  - 3: "anh", "nht", "hte", "tem"
> >>  - 4: "anht", "nhte", "htem"
> >>
> >> So, as you can see, this kind of misspelling never matches the suitable
> >> suggestions although the edit distance is 0.95555556.
> >>
> >> I think getMin(int l) and getMax(int l) should return 2 and 3,
> >> respectively, for l==6. Debugging other values i did not found any
> problem
> >> with any kind of misspelling.
> >>
> >> Any thoughts about this?
> >>
> >> --
> >> Un saludo,
> >> Samuel García
> >>
> >>
> ------------------------------**------------------------------**---------
> >> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.**apache.org
> <java-user-unsubscribe@lucene.apache
> >> .org>
> >> For additional commands, e-mail:
> >> java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
> >>
> >>
> >
>
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message