lucene-lucene-net-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noel Lysaght" <lysag...@hotmail.com>
Subject Re: Best way to store arrays?
Date Fri, 04 Jun 2010 08:13:30 GMT
Hi Josh,

Why not try this, partition you into 2 logical segments.

Segement 1 Has Fields
Record Identifier
All you standard query fields
Length Of Array

Segment 2 Has fields (for storing array elements)
Record Identifier
Array Element Number
Array Element Data.

Segment 1 and Segment 2 are linked using the Record Identifier


You then need to do 2 searches.
Search 1 is against segment 1 and will return you main set of data, you can 
retrieve the 700-800 thousand records from that very quickly.

When you require data about the array values with just get the data you 
interested in by searching on the matching Record Identifiers from your 
first search.


BUT:
One huge question that you don't seem to want to answer answer is why do you 
need potentially millions of pieces for data form a search?
Are you basically trying to use Lucene as a database? because it ain't 
that....

Kind Regards
Noel



--------------------------------------------------
From: "Josh Handel" <Josh.Handel@catapultsystems.com>
Sent: Thursday, June 03, 2010 11:10 PM
To: <lucene-net-user@lucene.apache.org>; 
<lucene-net-user@incubator.apache.org>
Subject: RE: Best way to store arrays?

> In some case I may want the array length, the bigger need will be for the 
> actual array of values..  And in that case of just wanting the count your 
> right I could store the array length..
>
> But it's only a symptom of a larger problem.. That is reading 1 field from 
> 800000 documents (out of about 50 million) is taking 5 minutes..
>
> And the split is not an impact on that (I took it out to see if that was 
> my problem).. it still took 5 minutes.. In fact skipping the split was 
> slightly slower, though I imaging that's just due to other random load on 
> my test box).
>
> Josh
>
> -----Original Message-----
> From: Robert Jordan [mailto:robertj@gmx.net]
> Sent: Thursday, June 03, 2010 5:05 PM
> To: lucene-net-user@incubator.apache.org
> Subject: Re: Best way to store arrays?
>
> If you only want the array length, why are you not storing it as a field 
> along with the array? This would save you millions of splits, hundred 
> thousands of temporary created arrays etc.
>
> I believe this is what Digy meant with XY-problem: you're showing a 
> problem Y with a trivial solution, but X is actually much harder and we 
> can't help because you don't show us X.
>
> Robert
>
> On 03.06.2010 22:31, Josh Handel wrote:
>> Nope,
>>     This is a best choose option given the clients constraints on our 
>> implementation.
>>
>> Also I can narrow down the problem... The queries take about 1.3 seconds 
>> to get all the documentIDs (in this case about 700~800 thousand).. it 
>> then takes (after moving the field lookup into the collector) about 5 
>> minutes to pull the field for all of those records..
>>
>> So regardless of the architectural choose, is there any way to speed up 
>> reading a field, or is 5 minutes to read 800,000 fields from an index 
>> while they are being collected seem correct?
>>
>>
>> I can live with my splitting magic.. It works, and the queries are 
>> finding the right documents..
>>
>> Also this isn't user facing, this is a process to process call so 
>> pagination isn't really a useful option here.
>>
>> Josh
>> -----Original Message-----
>> From: Digy [mailto:digydigy@gmail.com]
>> Sent: Thursday, June 03, 2010 3:14 PM
>> To: lucene-net-user@lucene.apache.org
>> Subject: RE: Best way to store arrays?
>>
>> Is this an XY-problem?
>> http://www.perlmonks.org/index.pl?node_id=542341
>>
>> DIGY
>>
>>
>> -----Original Message-----
>> From: Josh Handel [mailto:Josh.Handel@catapultsystems.com]
>> Sent: Thursday, June 03, 2010 8:24 PM
>> To: lucene-net-user@lucene.apache.org
>> Subject: Best way to store arrays?
>>
>> Guys,
>>     I have the following scenario.. I have several arrays of data that I 
>> need stored in Lucene. Currently I store each array in its own field, 
>> using a custom tokenizer to split on my delimeter (a random Unicode 
>> character).
>>
>> Then when people want to get hit count (on an arrayed field) I have to 
>> load up each hit document one by one and split and count the results.
>>
>> Is there a better way to store arrays? Right now the whole thing is MUCH 
>> slower than I expected..  taking 10's of minutes on hits of 700 or 800 
>> thousand records and then count the elements in the array?
>>
>> I already have a custom collector to get all and skip counting (it just 
>> gets the docID).. and below is how I am loading up the document...  (also 
>> the initial query is taking 10's of seconds, I've include a sample of it 
>> as well)..
>>
>> Document Structure
>> Field : ProfileID - A string
>> Field(Array) : Publishers - a list of Ints Field(Array} :
>> N_PublisherNames (repeat for each publisher in Publishers
>> field) - A list of strings
>> Field(Array): DataPoints - a list of Ints
>>
>> Example Query: (DataPoints:((3 OR 4) AND (54)) OR ((3 OR 4) AND (55)
>> AND
>> (100)) OR ((52))) AND (Publishers:1)
>>
>> How I execute the query to get a count (for a field)
>>
>> m_Log.Debug("Searching for items");
>>                  IndexSearcher searcher = 
>> IndexHelper.GetCurrentSearcher(index);
>>                  GetAllCollector gac =
>> IndexHelper.RunQuery<GetAllCollector>(index, query, searcher);
>>                  int count = 0;
>>                  m_Log.Debug("Opening found documents to count fields");
>>                  foreach (int docID in gac.DocIDs)
>>                  {
>>                      FieldSelector fs = new MapFieldSelector(new 
>> string[1] { field });
>>                      Document doc =
>> searcher.GetIndexReader().Document(docID,
>> fs);
>>                      foreach (Fieldable f in doc.GetFieldables(field))
>>                      {
>>                          if (f.IsStored())
>>                          {
>>                              string value = f.StringValue();
>>                              if (value.Contains(ListField.DELIMETER))
>>                              {
>>                                  count = count + 
>> value.Split(ListField.DELIMETER.ToCharArray()).Length;
>>                              }
>>                              else
>>                              {
>>                                  count++;
>>                              }
>>                          }
>>                      }
>>                  }
>>                  m_Log.Debug("Finished counting fields");
>>
>> Anyways, I am all ears on speeding this up :)
>>
>> FYI: If this looks like well-structured data that might fit in SQL pretty 
>> well, you're right, it could, but for reasons beyond our control we 
>> can't-so we are going for the next best approach.
>>
>> The current index is about 70gigs and a 50 million documents.. Eventually 
>> we will have 180 ~ 200 million documents in here.
>>
>> Josh Handel
>> Senior Lead Consultant
>> 512.328.8181 | Main
>> 512.328.0584 | Fax
>> 512.577-6568 | Cell
>> www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.c
>> om/>
>>
>> CATAPULT SYSTEMS INC.
>> ENABLING BUSINESS THROUGH TECHNOLOGY
>>
>>
>>
>>
>
>
>
> 

Mime
View raw message