lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: "Natural sorting" of documents in a Lucene index - possible?
Date Wed, 18 Aug 2010 14:37:24 GMT
> But - to come back to my original question... is there any way to have a
> "natural order" of documents other that the DocId In Lucene?

No.


--
Ian.


On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau <akaris@gmail.com> wrote:
> Cool, so I'll try these things -
>
> * Replace timestamps with YYYYMMDD - will minimize unique terms count;
> * Use NumericField's for dates and numbers - will remove all string sorting.
> Thanks guys!
>
> --
>
> But - to come back to my original question... is there any way to have a
> "natural order" of documents other that the DocId In Lucene? For example, is
> there any way to have an index automatically sorted on a specific field,
> like :
>
> DocId     Count     Data
> -------------------------------------
>  5         1       First test
>  1         3       Otter
>  8         4       Test
>  2         8       Aloha
>  10        11       Zulu
>  9        17       Bingo
>  3        46       Alpha test
>  6       112       Tango
>  4       120       Charlie test
>  7       200       Kiwi
>
> Notice the DocId and Data random orders, but Count is sorted. That would be
> the 'natural order' in the index, and searching for 'test' would return (in
> that order) :
>
> DocId     Count     Data
> -------------------------------------
>  5         1       First test
>  3        46       Alpha test
>   4       120       Charlie test
>
> Already sorted on the Count.
>
> Thanks!
>
> - Mike
> akaris@gmail.com
>
>
> On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea <ian.lea@gmail.com> wrote:
>
>> Using NumericField for dates and other numbers is likely to help a
>> lot, and removes padding problems.  I'd try that first, or just sort
>> the top n hits yourself.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau <akaris@gmail.com> wrote:
>> > I could at least drop hours/mins/sec, we don't need them, so my timestamp
>> > could become 'YYYYMMDD', that would cut the number of unique terms at
>> least
>> > for dates.
>> >
>> > What about my other question about numbers : *" We do pad our numbers
>> with
>> > zeros though (for example: 10 becomes 00000010, etc.) because we had
>> trouble
>> > with sorting (100 was smaller than 2) ; is that considered as "string
>> > sorting" ? This might explain a part of the problem."* ? Thanks.
>> >
>> > - Mike
>> > akaris@gmail.com
>> >
>> >
>> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>> >
>> >> Hmmm, I glossed over your comment about sorting the top 250. There's
>> >> no reason that wouldn't work.
>> >>
>> >> Well, one way for, say, dates is to store separate fields. YYYY, MM, DD,
>> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
>> >> +31 days + .... for a very small total. You pay the price though by
>> >> having to change your queries and sorts to respect all 6 fields...
>> >>
>> >> But I'd only really go there after seeing if other options don't work.
>> >>
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau <akaris@gmail.com>
>> wrote:
>> >>
>> >> > Would our approach to limit the search top 250 documents (and then
>> sort
>> >> > these 250 documents) work fine ? Or even 250 unique terms with a lot
>> of
>> >> > users is bad on memory when sorting ?
>> >> >
>> >> > We didn't look at trie fields - I will do though, thanks for the tip
!
>> >> >
>> >> > We do store the original 'Data' field (only the 'SearchableData' field
>> is
>> >> > analyzed, all other fields are not analyzed), the users mainly sort
on
>> >> > numeric values; not a lot on string values (in fact I could compltely
>> >> drop
>> >> > the sort by string feature). We do pad our numbers with zeros though
>> (for
>> >> > example: 10 becomes 00000010, etc.) because we had trouble with
>> sorting
>> >> > (100
>> >> > was smaller than 2) ; is that considered as "string sorting" ? This
>> might
>> >> > explain a part of the problem.
>> >> >
>> >> > Why/how would I reduce the count of unique terms?
>> >> >
>> >> >
>> >> > - Mike
>> >> > akaris@gmail.com
>> >> >
>> >> >
>> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> > >wrote:
>> >> >
>> >> > > If you have tens of millions of documents, almost all with unique
>> >> fields
>> >> > > that you're sorting on, you'll chew through memory like there's
no
>> >> > > tomorrow.
>> >> > >
>> >> > > Have you looked at trie fields? See:
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
>> >> > >
>> >> > > I'm a little concerned that the user can sort on Data. Any field
>> used
>> >> for
>> >> > > sorting
>> >> > > should NOT be analyzed, so unless you are indexing "Data"
>> unanalyzed,
>> >> > > that's
>> >> > > a problem. And if you are sorting on strings unique to each
>> document,
>> >> > > that's
>> >> > > also a memory hog. Not to mention whether capitalization counts.
>> >> > >
>> >> > > You might enumerate the terms in your index for each of the sortable
>> >> > fields
>> >> > > to figure out what the total number of unique terms each is and
use
>> >> that
>> >> > as
>> >> > > a basis for reducing their count....
>> >> > >
>> >> > > HTH
>> >> > > Erick
>> >> > >
>> >> > > On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau <akaris@gmail.com>
>> >> wrote:
>> >> > >
>> >> > > > Hi Erick,
>> >> > > >
>> >> > > > Here's some more details about our structure. First here's
an
>> example
>> >> > of
>> >> > > > document in our index :
>> >> > > >
>> >> > > >     PrimaryKey        = SJAsfsf353JHGada66GH6 (it's
a hash)
>> >> > > >     DocType           = X
>> >> > > >     Data              = This is the data
>> >> > > >     SearchableContent = This is the data
>> >> > > >     DateCreated       = <timestamp>
>> >> > > >     DateModified      = <timestamp>
>> >> > > >     Counter1          = 17
>> >> > > >     Counter2          = 3
>> >> > > >     Average           = 0.17
>> >> > > >     Cost              = 200
>> >> > > >
>> >> > > > The users are able to sort on almost all fields: Data,
>> DateCreated,
>> >> > > > DateModified, Counter1, Counter2, Average, Cost.
>> >> > > >
>> >> > > > When we search, we always search on the 'SearchableContent'
field
>> and
>> >> > we
>> >> > > > have at least one filter on the DocType (because we have
many
>> >> document
>> >> > > > types
>> >> > > > in the same index). So a common search that would find the
>> document
>> >> > above
>> >> > > > is
>> >> > > > "data *AND DocType:X*" (we automatically add the "*AND DocType:X*"
>> >> part
>> >> > > > using Lucene Filters.
>> >> > > >
>> >> > > > I would say that the number of unique terms in the field
being
>> sorted
>> >> > on
>> >> > > is
>> >> > > > very big - for example timestamps, almost all unique, counters,
>> >> > average,
>> >> > > > cost, data... so if a query finds 10M results, it's almost
10M
>> >> > different
>> >> > > > values to sort. About cache and warm-up queries : we don't
use
>> >> warm-up
>> >> > > > queries -at all- because we have absolutely no idea of what
users
>> are
>> >> > > going
>> >> > > > to search for (they can search for absolutely anything).
About
>> >> > "returning
>> >> > > > 10M" documents, right, we don't actually return the 10M documents,
>> we
>> >> > use
>> >> > > > pagination to return documents X to Y of the 10M (and the
10M was
>> >> only
>> >> > an
>> >> > > > example, it can be anywhere between 1K and 100M results).
The
>> >> > pagination
>> >> > > > usually works fine and fast, our problem is really sorting.
>> >> > > >
>> >> > > > Our "Lucene Reader" process has 2GB of ram allowed, here's
how I
>> >> start
>> >> > it
>> >> > > -
>> >> > > >
>> >> > > >     java -Xmx2048m -jar LuceneReader.jar
>> >> > > >
>> >> > > > The problem really seems to be a ram problem, but I can't
be 100%
>> >> sure
>> >> > > (any
>> >> > > > help about how to be sure is welcome).
>> >> > > >
>> >> > > > Our current idea of a solution would be to get maximum 250
results
>> >> (the
>> >> > > > more
>> >> > > > relevant ones; more results than that is totally useless
in our
>> >> system)
>> >> > > so
>> >> > > > the sort should work fine on a small data set like that,
but we
>> want
>> >> to
>> >> > > > make
>> >> > > > sure we're doing everything right before doing that so we
don't
>> run
>> >> in
>> >> > > the
>> >> > > > same problems again.
>> >> > > >
>> >> > > > Thank you very much; let me know if you need any more details!
>> >> > > >
>> >> > > > - Mike
>> >> > > > akaris@gmail.com
>> >> > > >
>> >> > > >
>> >> > > > On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson <
>> >> > erickerickson@gmail.com
>> >> > > > >wrote:
>> >> > > >
>> >> > > > > Let's back up a minute. The number of matched records
is not
>> >> > > > > important when sorting, what's important is the number
of unique
>> >> > > > > terms in the field being sorted. Do you have any figures
on
>> that?
>> >> One
>> >> > > > > very common sorting issue is sorting on very fine date
time
>> >> > > resolutions,
>> >> > > > > although your examples don't include that...
>> >> > > > >
>> >> > > > > Now, cache loading is an issue. The very first time
you sort on
>> a
>> >> > > field,
>> >> > > > > all the values are read into a cache. Is it possible
this is the
>> >> > source
>> >> > > > > of your problems? You can cure this with warmup queries.
The
>> >> > take-away
>> >> > > > > is that measuring the response time for the first sorted
query
>> is
>> >> > > > > very misleading.
>> >> > > > >
>> >> > > > > Although if you're sorting on many, many, many email
addresses,
>> >> > > > > that could be "interesting".
>> >> > > > >
>> >> > > > > The comment "returning 10,000,000 documents" is, I hope,
a
>> >> > > > > misstatement. If you're trying to *return* 10M docs
sorting
>> >> > > > > is irrelevant compared to assembling that many documents.
If
>> >> > > > > you're trying to return the first 100 of 10M documents,
it
>> should
>> >> > > > > work.
>> >> > > > >
>> >> > > > > Overall, we need more details on what you're sorting
and what
>> >> > > > > you're trying to return as well as how you're measuring
before
>> >> > > > > we can say much....
>> >> > > > >
>> >> > > > > Along with how much memory you're giving your JVM to
work with,
>> >> > > > > what "exploding" means. Are you CPU bound? IO bound?
Swapping?
>> >> > > > > You need to characterize what is going wrong before
worrying
>> about
>> >> > > > > solutions......
>> >> > > > >
>> >> > > > > Best
>> >> > > > > Erick
>> >> > > > >
>> >> > > > > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau <
>> akaris@gmail.com>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Hi,
>> >> > > > > >
>> >> > > > > > we are building an application using Lucene and
we have HUGE
>> data
>> >> > > sets
>> >> > > > > (our
>> >> > > > > > index contains millions and millions and millions
of
>> documents),
>> >> > > which
>> >> > > > > > obviously cause us very important problems when
sorting. In
>> fact,
>> >> > we
>> >> > > > > > disabled sorting completely because the servers
were just
>> >> exploding
>> >> > > > when
>> >> > > > > > trying to sort results in RAM. The users using
the system can
>> >> > search
>> >> > > > for
>> >> > > > > > whatever they want, so we never know how many results
will be
>> >> > > returned
>> >> > > > -
>> >> > > > > a
>> >> > > > > > search can return 10 documents (no problem with
sorting) or
>> >> > > 10,000,000
>> >> > > > > > (huge
>> >> > > > > > sorting problems).
>> >> > > > > >
>> >> > > > > > I woke up this morning and had a flash : is it
possible with
>> >> Lucene
>> >> > > to
>> >> > > > > have
>> >> > > > > > a "natural sorting" of documents? For example,
let's say I
>> have 3
>> >> > > > columns
>> >> > > > > I
>> >> > > > > > want to be able to sort by : first name, last name,
email; I
>> >> would
>> >> > > have
>> >> > > > 3
>> >> > > > > > different indexes with the very same data but with
a different
>> >> > > primary
>> >> > > > > key
>> >> > > > > > for sorting. I know it's far fetched, and I have
never seen
>> >> > anything
>> >> > > > like
>> >> > > > > > that since I use Lucene, but we're just desperate...
how
>> people
>> >> do
>> >> > to
>> >> > > > > have
>> >> > > > > > huge data sets, a lot of users, and sort!?
>> >> > > > > >
>> >> > > > > > Thanks,
>> >> > > > > >
>> >> > > > > > Mike
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message