Announcements‎ > ‎

Information Theory

posted Aug 8, 2010, 4:09 AM by Olaf Bochmann   [ updated Nov 10, 2010, 9:22 AM ]

Entropy and Information

Information is reduction in uncertainty and has nothing to do with knowledge. Imagine you are about to observe the outcome of a coin flip and the outcome of a die roll. Which event contains more information? 

After Abramson, the information contained in the outcome of an event E with probability P(E) is equal to log 1/P(E) bits of information. For the unit bits we use log base 2. The result of a fair coin flip we get (log 2 = 1 bit) and for the die roll (log 6 2.585 bits).


Now imagine a zero-memory information source X. The source emits symbols from an alphabet {x_1, x_2, . . . , x_k} with probabilities {p_1, p_2, . . . , p_k}. The symbols emitted are statistically independent. What is the average amount of information in observing the output of the source X?

Shannon formulated the most fundamental notion in information theory for a discrete random variable, taking values from $\mathcal{X}$. The entropy of X is

Alternative explanations of entropy can be the average amount of info provided per symbol, average amount of surprise when observing a symbol, uncertainty an observer has before seeing the symbol, avg # of bits needed to communicate each symbol. 


  • H[X]>=0, and = 0 only when Pr (X = x) = 1 for some x
  • H[X] is maximal when all X are equally probable, and then H[X] = log2 #\mathcal{X}H[f (X)] 
  • H[X], equality if and only if f is 1-1


H[X] measures:

  • How random X is
  • How variable X is
  • How uncertain we should be about X

“paleface” problem
consistent resolution leads to a completely subjective probability theory

So, a sequence of (independent) 0’s-and-1’s can provide up to 1 bit of information per digit, provided the 0’s and 1’s are equally likely at any point. If they are not equally likely, the sequence provides less information and can be compressed.

Description Length

H[X] = how concisely can we describe X?

Imagine X as text message:

  • in Reno
  • in Reno send money 
  • in Reno divorce final 
  • marry me? 
  • in Reno send lawyers guns and money kthxbai 

Known and finite number of possible messages (#X). I know what X is but won’t show it to you. You can guess it by asking yes/no (binary) questions

First goal: ask as few questions as possible

  • Starting with “is it y?” is optimal iff X = y
  • Can always achieve no worse than log2 #X questions

New goal: minimize the mean number of questions

  • Ask about more probable messages first
  • Still takes -log2 Pr (X = x) questions to reach x
  • Mean is then H[X]

Theorem: H[X] is the minimum mean number of binary distinctions needed to describe X. (Units of H[X] are bits)

Multiple Variables

Joint entropy of two variables X and Y:

Entropy of joint distribution: This is the minimum mean length to describe both X and Y

  • H[X;Y] >=. H[X]
  • H[X;Y] .<= H[X] + H[Y]
  • H[f (X);X] = H[X]

Entropy and Ergodicity

(Dynamical systems as information sources, long-run randomness)

Relative Entropy and Statistics

(The connection to statistics)


Cover and Thomas (1991) is the best single book on information theory.



Research Groups

Olaf Bochmann,
Nov 10, 2010, 8:08 AM