Mahalanobis distance as defined really depends on good estimates of
covariance and that is a pretty dense computation.
Generally, you need to define some restricted kind of covariance or a strong
prior distribution on covariance in order to get a reasonable estimate in
the kind of illposed problem that you are suggesting. One possibility is
to assume that covariance is sparse with only a few nondiagonal terms. If
you have some kind of heuristic for deciding where you will allow nonzeros,
then you can build an estimation technique. Not surprisingly, I recommend
LLR for finding interesting points.
Another option is to look at rank limited approximations like SVD which are
not dense, but do require less storage than fully dense representations.
Both of these can be viewed as decompositions, but this approach probably
only really helps with the SVD approach.
If I were working on this problem, though, I would strongly question the
need for Mahalanobis distance in the first place. It can be lovely, but it
really comes from an earlier time and I question the applicability to large
sparse matrices of counts.
On Wed, Jul 20, 2011 at 10:28 AM, Marco Turchi <marco.turchi@gmail.com>wrote:
> ...
> I have the last question:
> I need to compute the covariance matrix of some input data to compute the
> Mahalanobis distance. In my data, the number of features is bigger than the
> number of docs, and these data are very sparse. I cannot compute the
> covariance matrix using the classic approach, because it becomes very time
> and computational expensive. Is there any implementation of an approximate
> way to compute it in Mahout? I have had a look in the library, but I do not
> find it.
>
> thanks for your help
> Marco
>
>
> On 20 Jul 2011, at 19:15, Ted Dunning wrote:
>
> Well, that is hard to understand without a complete test case. In the
>> first
>> case you have a vector with one element constructed while in the second
>> case, you wind up with two elements and don't show the constructor.
>>
>> Write up a JIRA and attach a unit test.
>>
>> My strong expectation is that this is a case of refrigerator blindness.
>>
>> On Wed, Jul 20, 2011 at 9:22 AM, Marco Turchi <marco.turchi@gmail.com
>> >wrote:
>>
>> Hi
>>> you are right about the sparse vector issue... but I'm constructing distV
>>> in the same way changing only + and  in the variable mean. In both the
>>> cases, I should have the same number of entries in the final vector.
>>>
>>> Thanks a lot for your help
>>> Marco
>>>
>>>
>>> On 20 Jul 2011, at 17:42, Ted Dunning wrote:
>>>
>>> You constructed the first vector with a dimension of 1. It looks like
>>> you
>>>
>>>> constructed the second one with a larger dimension of 2.
>>>>
>>>> When you offset a sparse vector, all of the zeros become nonzero and
>>>> the
>>>> vector becomes dense. This results in a bunch of cells being created.
>>>>
>>>> On Wed, Jul 20, 2011 at 6:28 AM, marco turchi <marco.turchi@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Dear All,
>>>>
>>>>> I have a strange behaviour when I use the method Plus for Vector.
>>>>>
>>>>> I have a RandomAccessSparseVector vector, if I add a positive number,
I
>>>>> got
>>>>> a new Vector where each element is the sum of the old value plus the
>>>>> positive number. While if I add a negative number, the new vector has
1
>>>>> more
>>>>> entry:
>>>>>
>>>>>
>>>>> RandomAccessSparseVector distV = new RandomAccessSparseVector(1);
>>>>> distV.setQuick(0,1);
>>>>> double mean = 1;
>>>>> RandomAccessSparseVector app =
>>>>> (RandomAccessSparseVector)(****distV.plus(mean));
>>>>>
>>>>> the output is
>>>>> {0:2.0}
>>>>>
>>>>> if I have
>>>>> double mean = 1;
>>>>> RandomAccessSparseVector app =
>>>>> (RandomAccessSparseVector)(****distV.plus(mean));
>>>>>
>>>>> the output is
>>>>> {1:1.0,0:1.0}
>>>>>
>>>>> For sure I'm doing something wrong. Do you have any ideas where the
>>>>> problem
>>>>> is?
>>>>>
>>>>> Thanks a lot in advance
>>>>> Marco
>>>>>
>>>>>
>>>>>
>>>
>
