lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Svensson <si...@devhost.se>
Subject Re: Lucene.net Nested Documents support (lucene version 3.4)
Date Wed, 22 Aug 2012 08:47:44 GMT
The size of the index will grow, but not to any extreme values. If all 
values can be represented as normal 4-byte integers, then a type and two 
users would be 12 bytes [ typeId, firstUserId, secondUserId ]. You could 
go for other means of values based on your internal knowledge of most 
common types, size (and generation) of user ids, etc. Perhaps the same 
VInt (variable-length) integers that Lucene uses internally.

Assuming 12 bytes per document would be an increase of index size with 
about 12 megabytes per million documents. Lucene can handle far larger 
indexes than that. This is perhaps a workaround to proceed until true 
nested documents are introduced?

You could use the PositiveScoresOnlyCollector (which wraps another 
collector) to ignore hits scored with zero value.

Perhaps index every tag twice, the second time with type-information 
embedded. That would make it possible to search for tag+type if that's a 
common search, but you would still need to verify permissions.

// Simon

On 2012-08-22 10:23, Omri Suissa wrote:
> Thank you,
> I can understand how this can work on a small number of documents but what
> if i have millions of documents?
> then there could be a situation when a lot of documents will be returned by
> the query and only then we will set the score to 0.
> I would like to find a way that the documents will not be return from the
> query in the first place... (as far as i understand this way it will be
> much more efficient).
>
> Omri
>
> On Wed, Aug 22, 2012 at 10:12 AM, Simon Svensson <sisve@devhost.se> wrote:
>
>> Hi,
>>
>> First of, storing this data into the index would mean that you would
>> store the permissions at index-time, not query-time. Any changed
>> permissions would require an reindexing of the documents affected.
>>
>> You can accomplish this using payloads. I'm not sure on the technical
>> details regarding how they are read into memory, caching and such. I'm
>> using payloads for a small index (few thousand documents) to have a
>> timestamp on indexed values (a valid until-date) so documents no longer
>> matches a specific token after a set date. You could do something
>> similar where type- and permission information is encoded as a payload,
>> a byte-array, and verified at query time.
>>
>> The score is calculated using a custom similarity, specified with
>> indexSearcher.SetSimilarity(**new ValiditySimilarity());
>>
>>      public class ValiditySimilarity : DefaultSimilarity {
>>          public override Single ScorePayload(Int32 docId, String
>> fieldName, Int32 start, Int32 end, Byte[] payload, Int32 offset, Int32
>> length) {
>>              var validTo = BitConverter.ToInt64(payload, offset);
>>              if (DateTime.Now.Ticks < validTo)
>>                  return 1;
>>
>>              return 0;
>>          }
>>      }
>>
>> The actual payload is generated by a custom token stream when indexing
>> the document.
>>
>>      document.Add(new Field("FieldName", GetTokenStream("value1 value2",
>> DateTime.Now.AddDays(1))));
>>
>>      private static TokenStream GetTokenStream(String value, DateTime
>> validTo) {
>>          var valueReader = new StringReader(value);
>>          var stream = new StandardTokenizer(V.LUCENE_29, valueReader);
>>          stream = new LowerCaseFilter(stream);
>>          stream = new ValidityPayloadFilter(stream, validTo);
>>          return stream;
>>      }
>>
>>      public class ValidityPayloadFilter : TokenFilter {
>>          private readonly DateTime _validTo;
>>          private readonly PayloadAttribute _payloadAttribute;
>>
>>          public ValidityPayloadFilter(**TokenStream stream, DateTime
>> validTo)
>>              : base(stream) {
>>              _validTo = validTo;
>>              _payloadAttribute =
>> (PayloadAttribute)**AddAttribute(typeof(**PayloadAttribute));
>>          }
>>
>>          public override Boolean IncrementToken() {
>>              if (!input.IncrementToken())
>>                  return false;
>>
>>              var bytes = BitConverter.GetBytes(_**validTo.Ticks);
>>
>>              var payload = new Payload(bytes);
>>              _payloadAttribute.SetPayload(**payload);
>>              return true;
>>          }
>>      }
>>
>> // Simon
>>
>>
>>
>> On 2012-08-22 08:13, Omri Suissa wrote:
>>
>>> Hi Simon,
>>> Thanks for the help.
>>> This is my scenario:
>>> My search application allow users to add manual tags to each document,
>>> each
>>> tag have a name, type and permissions.
>>> When searching I would like to have the following options:
>>> 1) get all the document that contains specific tag (with any type) that I
>>> have permission to view
>>> 2) get all the document that contains specific tag with specific type that
>>> I have permission to view
>>>
>>> For example if I have 2 documents:
>>> Doc A with tags:
>>>            X (type 1, permissions: everyone)
>>>            Y (type 1, permissions: User1, User2)
>>>            Z (type 2, permissions: User1)
>>>
>>> Doc B with tags:
>>>            X (type 2, permissions: everyone)
>>>            Y (type 4, permissions: everyone)
>>>            Z (type 2, permissions: User1)
>>>
>>> I'll be able to find A and B when searching for all documents with tag X,
>>> only A if X with type 1 and non of the if tag Z and i'm User2 (and so
>>> on...).
>>>
>>> So nested documents could really help me where each tag is a sub document
>>> (like sql JOIN operation).
>>>
>>> What can I do using the current capabilities?
>>>
>>> Thank you for the help,
>>> Omri
>>>
>>> On Tue, Aug 21, 2012 at 8:02 PM, Simon Svensson <sisve@devhost.se> wrote:
>>>
>>>   Hi,
>>>> I do not have an answer to your explicit question, but this mail group
>>>> could perhaps help you with workarounds using the current functionality.
>>>> Are you after the search functionality (field1:a and field2:b) with child
>>>> documents? Or grouping of the results (the sql equivalent of group by)?
>>>> Return the first 5 entries of every group (like a Google search does per
>>>> site)?
>>>>
>>>> // Simon
>>>>
>>>>
>>>> On 2012-08-21 16:00, Omri Suissa wrote:
>>>>
>>>>   Hi everyone,
>>>>> We are currently implementing Lucene .net in our solution and we need
to
>>>>> use the Lucene Nested Documents support that was introduce in Lucene
>>>>> version 3.4
>>>>> If I understand correctly the current version of Lucene .net does not
>>>>> support this feature (and other 3.4 features), there is a timeline for
>>>>> the
>>>>> 3.4 porting to .net?
>>>>>
>>>>> Thank you,
>>>>> Omri
>>>>>
>>>>>
>>>>>
>>
>>
>>


Mime
View raw message