lucene-lucene-net-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Handel <Josh.Han...@catapultsystems.com>
Subject RE: Best way to store arrays?
Date Thu, 03 Jun 2010 22:10:56 GMT
In some case I may want the array length, the bigger need will be for the actual array of values..
 And in that case of just wanting the count your right I could store the array length.. 

But it's only a symptom of a larger problem.. That is reading 1 field from 800000 documents
(out of about 50 million) is taking 5 minutes.. 

And the split is not an impact on that (I took it out to see if that was my problem).. it
still took 5 minutes.. In fact skipping the split was slightly slower, though I imaging that's
just due to other random load on my test box).

Josh

-----Original Message-----
From: Robert Jordan [mailto:robertj@gmx.net] 
Sent: Thursday, June 03, 2010 5:05 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: Best way to store arrays?

If you only want the array length, why are you not storing it as a field along with the array?
This would save you millions of splits, hundred thousands of temporary created arrays etc.

I believe this is what Digy meant with XY-problem: you're showing a problem Y with a trivial
solution, but X is actually much harder and we can't help because you don't show us X.

Robert

On 03.06.2010 22:31, Josh Handel wrote:
> Nope,
>     This is a best choose option given the clients constraints on our implementation.
>
> Also I can narrow down the problem... The queries take about 1.3 seconds to get all the
documentIDs (in this case about 700~800 thousand).. it then takes (after moving the field
lookup into the collector) about 5 minutes to pull the field for all of those records..
>
> So regardless of the architectural choose, is there any way to speed up reading a field,
or is 5 minutes to read 800,000 fields from an index while they are being collected seem correct?
>
>
> I can live with my splitting magic.. It works, and the queries are finding the right
documents..
>
> Also this isn't user facing, this is a process to process call so pagination isn't really
a useful option here.
>
> Josh
> -----Original Message-----
> From: Digy [mailto:digydigy@gmail.com]
> Sent: Thursday, June 03, 2010 3:14 PM
> To: lucene-net-user@lucene.apache.org
> Subject: RE: Best way to store arrays?
>
> Is this an XY-problem?
> http://www.perlmonks.org/index.pl?node_id=542341
>
> DIGY
>
>
> -----Original Message-----
> From: Josh Handel [mailto:Josh.Handel@catapultsystems.com]
> Sent: Thursday, June 03, 2010 8:24 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Best way to store arrays?
>
> Guys,
>     I have the following scenario.. I have several arrays of data that I need stored
in Lucene. Currently I store each array in its own field, using a custom tokenizer to split
on my delimeter (a random Unicode character).
>
> Then when people want to get hit count (on an arrayed field) I have to load up each hit
document one by one and split and count the results.
>
> Is there a better way to store arrays? Right now the whole thing is MUCH slower than
I expected..  taking 10's of minutes on hits of 700 or 800 thousand records and then count
the elements in the array?
>
> I already have a custom collector to get all and skip counting (it just gets the docID)..
and below is how I am loading up the document...  (also the initial query is taking 10's of
seconds, I've include a sample of it as well)..
>
> Document Structure
> Field : ProfileID - A string
> Field(Array) : Publishers - a list of Ints Field(Array} : 
> N_PublisherNames (repeat for each publisher in Publishers
> field) - A list of strings
> Field(Array): DataPoints - a list of Ints
>
> Example Query: (DataPoints:((3 OR 4) AND (54)) OR ((3 OR 4) AND (55) 
> AND
> (100)) OR ((52))) AND (Publishers:1)
>
> How I execute the query to get a count (for a field)
>
> m_Log.Debug("Searching for items");
>                  IndexSearcher searcher = IndexHelper.GetCurrentSearcher(index);
>                  GetAllCollector gac = 
> IndexHelper.RunQuery<GetAllCollector>(index, query, searcher);
>                  int count = 0;
>                  m_Log.Debug("Opening found documents to count fields");
>                  foreach (int docID in gac.DocIDs)
>                  {
>                      FieldSelector fs = new MapFieldSelector(new string[1] { field });
>                      Document doc = 
> searcher.GetIndexReader().Document(docID,
> fs);
>                      foreach (Fieldable f in doc.GetFieldables(field))
>                      {
>                          if (f.IsStored())
>                          {
>                              string value = f.StringValue();
>                              if (value.Contains(ListField.DELIMETER))
>                              {
>                                  count = count + value.Split(ListField.DELIMETER.ToCharArray()).Length;
>                              }
>                              else
>                              {
>                                  count++;
>                              }
>                          }
>                      }
>                  }
>                  m_Log.Debug("Finished counting fields");
>
> Anyways, I am all ears on speeding this up :)
>
> FYI: If this looks like well-structured data that might fit in SQL pretty well, you're
right, it could, but for reasons beyond our control we can't-so we are going for the next
best approach.
>
> The current index is about 70gigs and a 50 million documents.. Eventually we will have
180 ~ 200 million documents in here.
>
> Josh Handel
> Senior Lead Consultant
> 512.328.8181 | Main
> 512.328.0584 | Fax
> 512.577-6568 | Cell
> www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.c
> om/>
>
> CATAPULT SYSTEMS INC.
> ENABLING BUSINESS THROUGH TECHNOLOGY
>
>
>
>




Mime
View raw message