lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Zhang <smartag...@gmail.com>
Subject Re: processing documents in solr
Date Mon, 29 Jul 2013 04:56:50 GMT
Basically, I was thinking about running a range query like Shawn suggested
on the tstamp field, but unfortunately it was not indexed. Range queries
only work on indexed fields, right?


On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang <smartagent@gmail.com> wrote:

> I've been thinking about tstamp solution int the past few days. but too
> bad, the field is avaialble but not indexed...
>
> I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
> counter value. If yes, that would be equivalent to an autoincrement id. I'm
> indexing from Nutch though; don't know how to feed in such counter...
>
>
> On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson <erickerickson@gmail.com>wrote:
>
>> Why wouldn't a simple timestamp work for the ordering? Although
>> I guess "simple timestamp" isn't really simple if the time settings
>> change.
>>
>> So how about a simple counter field in your documents? Assuming
>> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter
>> desc.
>> Take the counter from the first document returned. Increment for
>> each doc for the life of the indexing run. Now you've got, for all intents
>> and purposes, an identity field albeit manually maintained.
>>
>> Then use your counter field as Shawn suggests for pulling all the
>> data out.
>>
>> FWIW,
>> Erick
>>
>> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
>> <mcucchiara@apache.org> wrote:
>> > In both cases, for better performance, first I'd load just all the IDs,
>> > after, during processing I'd load each document.
>> > For what concern the incremental requirement, it should not be
>> difficult to
>> > write an hash function which maps a non-numerical I'd to a value.
>> >  On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartagent@gmail.com> wrote:
>> >
>> >> Dear list:
>> >>
>> >> I have an ever-growing solr repository, and I need to process every
>> single
>> >> document to extract statistics. What would be a reasonable process that
>> >> satifies the following properties:
>> >>
>> >> - Exhaustive: I have to traverse every single document
>> >> - Incremental: in other words, it has to allow me to divide and
>> conquer ---
>> >> if I have processed the first 20k docs, next time I can start with
>> 20001.
>> >>
>> >> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
>> >> fact, given that the processing will take very long, and the repository
>> >> keeps growing, it is not even clear that the exhaustiveness is
>> achieved.
>> >>
>> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop
>> capability
>> >> yet. But I guess the same issues still hold even if I have the solr
>> cloud
>> >> environment, right, say in each shard?
>> >>
>> >> Any help would be greatly appreciated.
>> >>
>> >> Joe
>> >>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message