I guess the best way to reply to your message is with the help of the
following quote from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.6035
"Since the existence of an item in a transaction is indicated by a
probability, an advantage of the existential uncertain data model is that it
allows more information to be captured by the dataset. Consider again the
example patient dataset. If we adopt a binary data model, then each
symptom/illness can either be present (1) or absent (0) in a patient record.
Under the binary model, data analysts will be forced to set a threshold
value for each symptom/illness to quantize the probabilities into either 1
or 0. In other words, information about those (marginally) low values is
discarded. The uncertain data model, however, allows such information be
retained and be available for analysis. The disadvantage of retaining such
information is that the size of the dataset would be much larger than that
under the quantized binary model. This is particularly true if most
of the existential probabilities are very small. Consequently, mining
algorithms will run a lot slower on such large datasets."
Moving apart from the definitions of uncertain data, I don't think that you
guys are too excited with the whole uncertain data mining algorithm idea ;)
On Thu, Mar 24, 2011 at 4:07 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> That still doesn't sound tenable.
>
> The problem is that you have an observable or you don't. In your example,
> you have an observable "doctor says that patient has a disease". You don't
> have access to the internal state of mind of the doctor, only what they say
> or do. The thing that is uncertain is itself unobservable.
>
> I say this as a pretty committed Bayesian who things that probability is
> the best way for us to describe our uncertain state of knowledge about the
> world. I am thus pretty sympathetic to uncertain inferences. But
> observations are not uncertain, they are what they are. They may be the
> noisy transduction of some unobserved real thing where the noise is due to
> an uncertain measurement process. But that is just another case of
> observations being certain and the unobservable thing being uncertain. As
> such, the data are not uncertain, just our interpretation.
>
> Given that, the right thing to do is actually model a posterior
> distribution of the unobserved uncertain cause. Frequent itemsets are all
> about counting of observations, not inference. We may do inference from
> those counts, but that doesn't impinge on the counting process.
>
>
> On Thu, Mar 24, 2011 at 1:11 PM, Yarco Hayduk <yarcoh@gmail.com> wrote:
>
>> In uncertain FPM we suspect but can not guarantee the presence or absence
>> of an item. Say when a doctor is 90% sure that a patient suffers from a
>> particular kind of a disease. This uncertainty can be expressed in terms of
>> existential probability. More precisely, uncertain data in this case means
>> existentially uncertain data.
>>
>> This paper's introduction has a great explanation of uncertain data.
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.4656&rep=rep1&type=pdf
>>
>> In the previous email when I was talking about the sampling approach, I
>> meant to post this link
>> http://wwwis.win.tue.nl/~tcalders/pubs/CALDERSPAKDD10.pdf
>>
>> sorry for the confusion
>>
>> yarco;)
>>
>>
>> On Thu, Mar 24, 2011 at 1:45 PM, Ted Dunning <ted.dunning@gmail.com>wrote:
>>
>>> I don't see the value here. This paper doesn't make their definitions
>>> clear and I find it particularly
>>> troubling that they seem to assume that the observations being analyzed
>>> involve the observation of
>>> probabilities.
>>>
>>> In every case that I know of (and I mean every, not most) the
>>> observations are not uncertain. They
>>> may be the result of an underlying stochastic process and I think it very
>>> important to incorporate that
>>> assumption into any analysis, but the data themselves do not involve any
>>> observed probabilities. In
>>> some sense, the data that we observe are already the result of sampling
>>> from the probabilities that
>>> this paper refers to.
>>>
>>> As such, I really don't see what value this work has.
>>>
>>> I might agree with the authors that a Bayesian approach is required in
>>> which we consider our task to
>>> be the estimation of these underlying probabilities rather than the
>>> rather pedestrian way that most
>>> frequent item set algorithms simply assume that counts are the end of the
>>> story.
>>>
>>> But that doesn't sound like what the paper is saying. Maybe I am
>>> misreading it.
>>>
>>>
>>> On Thu, Mar 24, 2011 at 10:53 AM, Yarco Hayduk <yarcoh@gmail.com> wrote:
>>>
>>>> What Mahout does not have at this point is the ability to mine frequent
>>>> patterns from uncertain data.
>>>>
>>>> You could implement this feature in two ways:
>>>>
>>>> 1. Using sampling (http://www.usukita.org/papers/4931/freq.pdf )
>>>> 2. Rewriting the FPM algorithms so that they can work with uncertain
>>>> data (http://www.usukita.org/papers/4931/freq.pdf)
>>>>
>>>> I could implement the UHMine algorithm that would work with uncertain
>>>> data. Its described in (2).
>>>> yarco;)
>>>>
>>>>
>>>> > On Thu, Mar 24, 2011 at 12:27 PM, Ted Dunning <ted.dunning@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Moderately good as a project.
>>>> >> I think that Mahout would benefit more from new capabilities than
>>>> from a reimplementation of old capabilities with a new algorithm. Item set
>>>> mining is not particularly subject to radical improvement in terms of the
>>>> results. Speed improvements would be nice, but are not a major point of
>>>> pain with the current implementation.
>>>> >> What do you see about this project that would be good for Mahout?
>>>> Are there related things that you need that Mahout doesn't do?
>>>> >>
>>>> >> On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <yarcoh@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> So does this look like a valid project idea?
>>>> >>>
>>>> >>> Thanks
>>>> >>>
>>>> >>> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <yarcoh@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> > I believe that Section 3.1 of the aforementioned paper
talks about
>>>> the
>>>> >>> > parallel version of HMine.
>>>> >>> >
>>>> >>> > You are right  HMine has a backtracking step, which adds
nodes
>>>> to the
>>>> >>> > next node and introduces a dependency. i.e. you can not
start
>>>> working on the
>>>> >>> > i+1 header element before the ith element is not mined.
>>>> >>> >
>>>> >>> > The HMine algorithm works as follows:
>>>> >>> >
>>>> >>> > 1. prunes the intial DB such that all singleton infrequent
items
>>>> are
>>>> >>> > removed
>>>> >>> > 2. divides the pruned db into equal chunks.
>>>> >>> > 3. mines these chunks separately using the HMine(mem)
algorithm
>>>> >>> > 4. joins the results and
>>>> >>> > 5. scans the pruned db once again to remove false positives
and
>>>> obtain the
>>>> >>> > actual counts.
>>>> >>> >
>>>> >>> > I'm pretty sure that I can map these steps to MapReduce.
>>>> >>> >
>>>> >>> > Having said that, I am not sure that this approach would
work
>>>> better than
>>>> >>> > Parallel FPGrowth.
>>>> >>> >
>>>> >>> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <robin.anil@gmail.com>
>>>> wrote:
>>>> >>> >
>>>> >>> >> We have an Parallel FPGrowth implementation. I have
read that by
>>>> itself
>>>> >>> >> HMine is faster in sequential version, but it may not
be of use
>>>> to Mahout in
>>>> >>> >> that format. If you are able to propose a parallel
implementation
>>>> of HMine,
>>>> >>> >> using MapReduce, it will be of great Interest to Mahout.
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <yarcoh@gmail.com>
>>>> wrote:
>>>> >>> >>
>>>> >>> >>> Hi,
>>>> >>> >>>
>>>> >>> >>> For the *Google Summer of Code 2011 would *you
be interested if
>>>> I
>>>> >>> >>> implemented the HMine algorithm?
>>>> >>> >>>
>>>> >>> >>> http://www.cs.sfu.ca/~jpei/publications/Hminejn.pdf
>>>> >>> >>>
>>>> >>> >>> Thank you,
>>>> >>> >>> yarco;)
>>>> >>> >>>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>
>>>> >
>>>>
>>>>
>>>
>>
>
