Probabilistic Models


One important use of probabilistic models is when we try to predict some phenomenon and we don't know all the factors that can influence, or we do, but they are much too complicated to build a precise model of.

Exactly the situation with cards (for most of us)!

A set
of
outcomes

We consider drawing cards from a deck of ordinary playing cards.

Let:

  • CARDS = { A♠,2♠,3♠, ... 10♠, A♣,2♣,3♣, ... 10♣, A♦,2♦,3♦, ... 10♦, A♥,2♥,3♥, ... 10♥, }
  • B = Black Cards
  • N = Non Black (Red) cards
  • F = Face cards (K, Q, J all suits)
  • X = non-face cards (A, 2,... 10, all suits)

We define |X| as the number of elements in the set X (also called the cardinality of the set).

    |B| as the number of black cards (=26)
Sample
Spaces
 

The mathematical set of possible outcomes of the sort we care about, often called Ω.

When we are talking ONLY about black versus bob-black (B, N) there are two outcomes. The sample space is

    Ω = {B, N}
When we are talking about tossing a coin the set of outcomes is Head or Tails (H or T):
    Ω = { H, T}
When we are talking about sequences of 3 coin tosses, there are 8 possible outcomes:
    Ω = {HHH,HHT, HTH, HTT,THH, THT, TTH, TTT}
When we are talking about color outcomes for 3 card choices, the sample space is:
    Ω = {BBB,BBN, BNB, BNN,NBB, NBN, NNB, NNN}
Where B represents a black outcome and N represents a non-Black outcome.

On the other hand we may want to be very fine-grained and care about wbout which individual card was chosen. Then the sample space is the set CARD that we introduced above

    CARDS = { A♠,2♠,3♠, ... 10♠, A♣,2♣,3♣, ... 10♣, A♦,2♦,3♦, ... 10♦, A♥,2♥,3♥, ... 10♥, }

Samples

Contrast the sample space with a sample, some actual sequence of events gotten by experiment or sorting through a corpus or performing experiments.

Say we draw 1000 cards from the deck, each time replacing the card we get in the deck. That's a sample of size 1000.

Samples are sequences of trials. Each trial yields a element of the sample space:

      Coins Cards
    Sample ⟨ HH, HT, HH, TH, HH, ... ⟩ ⟨ NB, BB, NN, BN, NB, ... ⟩
    Sample
    Space
    {HT, HH, TH, TT} {NB,NN,BN,BB}

With any real sample we can ask the question: Is it biased? That is, has it been chosen in some way that changes the probabilistic properties of the space. Is that coin that gave us two heads in a row weighted? Did we get our card deck from a pinochle player?

It looks as if the coin may be biased, and the card samples not.

Why?

Events  

We will associate probabilities with sets of outcomes which we call events.

Consider the sample space CARDS

Let's identify some possible events in CARDS:

    B: Black cards = { A♠, 2♠ 3♠, ... K♠, 2♣, ...}
    N: Non black card = { A♥, 2♥ 3♥, ... K♥, 2♦, ... }

    F: Face card = { K♠, Q♠, J♠ K♣, Q♣, J♣ K♥, Q♥, J♥ K♦, Q♦, J♦ }
    X: Non Face card = { A♠, 2♠ 3♠, ... 10♠, A♣, 2♣, 3♠ ..., 10♣, A♥, 2♥ 3♥, ... 10♥, A♦, 2♦, 3♦ ..., 10♦ }


    B ∪ F: Black card or Face card
    B ∪ X: Black card or non-Face card
    F ∪ X: Face card or Non-Face card
    F ∪ N: Face card or non-Black card
    F ∪ N ∪ X: Face card or non-Black card or non-Face card
    {   } : The IMPOSSIBLE EVENT
The events described by a set of outcomes always be thought of as disjunctions. That is B is A♠ or A♣ or 2♠ or 2♣ ...
Probability
Distribution
 

Prob is called a probability distribution function. It assigns a number between 0 and 1 to every set of outcomes (every event) in the sample space.

We require two properties of prob:

    I. The sum of the probs of disjoint events is the prob of their union:
      Prob(A U B) = Prob(A) + Prob(B) [A and B disjoint]
    II. Prob(Ω) = 1
It follows directly from I and II that
    Prob({ } ) = 0
Conditional
Probability
 

Let Chosen be a sample, a sequence of cards, and let Blk be the subset of those utterances that are black and let Fc be the subset that are Face cards.

We define P(B | F), the conditional probability of B given F as follows:

  • Prob(B | F) = Prob(B, F) ÷ Prob(F)
This is the probability of a card being black when drawn from a set of randomly chosen face cards.

Conditional probabilities should be thought of as Probabilities relativized to subsets of the sample space. That is, P(_|F) is a probability distribution over the events of Ω. By Axiom II:

    x ∈ F Prob({x}) = Prob(F)
By the definition of conditional probability
    x ∈ Ω Prob({x}|F) = (∑x ∈ F Prob({x})) ÷ F = Prob(F)/Prob(F) = 1
Chain Rule

From the definition of Conditional probability we immediately get two equivalent formulations of the chain rule.
  1. Prob(B | F) = Prob(B, F) ÷ Prob(F)
  2. Prob(B,F) = Prob(B | F) * Prob(F)

independence  

An immediate consequence of the chain rule is an account of a special case. Suppose
  • Prob(B | F) = Prob(B).
In this very special case, we call the features (or events in probability talk) F and B independent. In this very special case (and ONLY then):
  • Prob(B,F) = Prob(B) * Prob(F)
Bayes' rule

The chain rule is symmetric:

  1. Prob(B,F) = Prob(B | F) * Prob(F)
  2. Prob(B,F) = Prob(F | B) * Prob(B)

So it follows

Prob(B | F) * Prob(F) = Prob(F | B) * Prob(B)

This is called Bayes' Rule.

Bayes' Rule is often written in this form:

  • Prob(A | B) = Prob(B | A) * Prob(A) ÷ Prob(B)
Maximum
Likelihood
Estimate
of
Probability

Prob(A♥) is a number x between 0 and 1. It represents the actual probability that a card will be the A♥ when randomly selected from a representative set of cards.

Claim: As you draw more and more cards (each time returning the drawn card to the deck), the ratio of the number of A♥ outcomes to the total number of outcomes will tend to approach x [the actual probability!].

Let Chosen be the total set of outcomes, and A♥ the set of outcomes that were the A♥. We estimate the probability of drawing an A♥ as follows:

    Prob(A♥) ≅ |A♥| ÷ |Chosen|
Where ≅ means "approximately equals".

This estimate of the probability is called a maximum likelihood estimate. We now justify this name.

We can think of drawing a card as an experiment with 2 outcomes, A♥ or not A♥, and call those draws that were the A♥ successes. Given the true probability p of a success we can compute the probability of k successes for any sequence of N independent experiments with something called the Binomial Distribution (Binomial(k | N, p)):

    Binomial(k | N, p) = the probability of k successes out of n trials when the probability of a single success is p

When we do a maximum likelihood estimate of p, we ask what p maximizes the probability of our sample. That is, fixing the facts of our sample (N, k), we ask what pemp is such that

    pemp = argmaxp Binomial(k | N, p)

Let's say in 1000 trials we get 19 draws with A♥. We look for our max:

    Sorry, your browser seems not to support the HTML 5 canvas element
And it turns out that:
    pemp = |A♥|/|Chosen| = 19/1000 = .01923 ≅ 1/52
Maximum
Likelihood
Estimate
of
Conditional
Probability

    Prob(B, F) ÷ Prob(F) (|B ∩ F| ÷ |Chosen|) ÷ (|F| ÷ |Chosen|)
      (|B ∩ F| ÷ |Chosen|) * (|Chosen| ÷ |F|)
      (|B ∩ F| ÷ |F|)
      Prob(B | F)
So this is consistent with the idea that Prob(_ | F) is just a probability distribution with F as the sample space.


Bayes' Rule: An example

Let's imagine we have a contagious disease and a test for the disease. The facts are the following:

Our question is a policy question. Do we quarantine subjects who have a positive test result?

We're interested in the probability that a subject has the disease GIVEN a positive result. We use Bayes' rule:

The facts of the problem directly give us several of the numbers on the right hand side:

  1. Prob(P |D) = 1.0
  2. Prob(D) = .01

We can compute Prob(P) as follows:

We now have all the numbers to plug into Bayes' rule:

Conclusion: A positive test result gives us a little better than a one in six chance that the subject has the disease!

Summarizing what we know:

    • Prob(P | D) = 1.0
    • Prob(P | H) = .05
    • Prob(D) = .01

    • Prob(D | P) = .168

What's going on? Rounding off some: So the ratio of the disease rate to the positive rate is what shrinks our result. The large size of Prob(P) is in turn mostly due to the false positive rate. Conclusion: This test has WAY too large a false positive rate for a disease this rare.

 
A Similar Test Population (300)
D Bonafide Positive P
H False Positive
Negative N
 
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
False P = 15/297 ≅ 5 %
Prob(H|P) = 15/18 = 83 1/3 %
Prob(D|P) = 3/18 = 16 2/3 %