Chapter 7
Information Theory

7.1  Redundancy

Information for this chapter was taken from Steven Roman's book Coding and information theory, Springer, 1992, pp.1-4.

A common model for a communications channel is the binary symmetric channel. This is defined as follows: the input data running through the binary symmetric channel is a string of bits (zeros and ones), and the channel is `noisy', that is, there is a non-zero constant probability p < 1/2 that a zero is changed to a one, or vice versa. We say that a channel error or bit error occurs with probability p, no matter what the input bit is.

If the receiver simply decides to accept a received bit as correct, then the probability of making a decision error on each bit of the input data is p.

We want a scheme to enable us to detect and possibly correct such errors. This is achieved by introducing redundancy into the input data. One way is to use a repetition code:

Suppose that for each 0 in the original input string we transmit 000, and for each 1 in the input string we transmit 111. If, say, a triple 010 is received, the decoder must make a decision on whether 000 or 111 has been sent. Suppose the decoder uses the `majority decision rule', that is, the decoder decides that the original bit was a 0 because there are more zeros than ones in the received triple 010.

This decision will be correct if the original triple was 000 and the second bit was changed, from 0 to 1, but will be incorrect if the original triple was 111 and the first and third bits were changed from 1 to 0.

Thus the `majority decision rule' (and indeed any decision rule) involves a decision error.

If we assume that errors occur independently (which is a convenient assumption, but not always reasonable) then the probability perror of making a decision error with the procedure described above is the probability that 2 or 3 bits in a triple have changed, and this equals
perror = æ
è
3
2
ö
ø
p2(1-p) + æ
è
3
3
ö
ø
p3 = 3p2 -2p3.
Note that perror is always less than p because p < 1/2.

Conclusion 1. Encoding the original source data by introducing redundancy may reduce the probability of decision error.

A general repetition code involves transmitting a string 00... 0 of length 2n+1 for each input bit 0, and transmitting a string 11... 1 of length 2n+1 for each input bit 1. We use an odd number 2n+1 to enable us to use the `majority decoding rule': if a received string of odd length 2n+1 contains more zeros than ones then we decode it as 0, and vice versa.

Thus the probability perror of making a decision error is the probability that there are at least n+1 bit errors in the received string. The number of bit errors is a binomial random variable with parameters (2n+1, p) and the theory of binomial random variables tells us


    (i) that the expected number of errors is (2n+1)p, which is less than n+1/2 since p < 1/2, and
    (ii) that perror ® 0 as n®¥.

However, using 2n+1 units of channel time to send one source bit is very expensive. One measure of this transmission cost is
rate of transmision =  length of source message

length of encoded message
For the general repetition code, the rate of source transmission is 1/(2n+1).

Conclusion 2. The probability of decision error can be made arbitrarily small at the expense of lowering the rate of transmission.

However the rate of transmission might become unacceptably small.

Basic Idea. We usually group the source data, one or several bits at a time, and encode each group of bits into a codeword by an encoding scheme (such as a repetition code) which assigns a codeword to each group from the source message. This encoding adds redundancy to enable detection of errors, and possibly also correction of errors.

Source     ®1011    encoder    ®1011010     channel (may introduce errors)  ®1010010    decoder (corrects errors)    ®1011 receiver.

There is a remarkable result due to Shannon (1948), and father of Information Theory. Claude Shannon was the mathematician who laid the foundations of Information Theory while working at Bell Labs in the US. He died this year (2001) at the age of 84. You can find an obituary for him at the

Bell Labs web site.

Shannon's Noisy Channel Theorem For every channel there is a number C called the capacity of the channel such that, if it is acceptable to have a rate of transmission less than C, then there exists an encoding scheme that will reduce the probability of decision error to any desired (low) level.

It turns out that, for the binary symmetric channel,
C=1+plog2p+(1-p)log2(1-p)
where p is the probability of bit error. (Here log2 p, the `log to base 2 of p', is the real number a such that p=2a.)

For example, if p=0.01 then C ~ 0.919.

Problems 1. Shannon's Noisy Channel Theorem does not give a construction for these good encoding schemes. The proof only shows that they exist. No one has yet discovered how to construct them.

2. There are other desirable qualities for encoding schemes that compete with the quality of having a very small probability of decision error. For example, we would like to have easy encoding and decoding procedures. Also we want the codewords to have moderate length (and as we saw with repetition codes, sometimes a small decision error probability requires very long codewords).

7.2  Information and uncertainty

Shannon(1948) introduced a function to measure the uncertainty involved in making random selections or sampling from a finite set, and he wanted also to measure the information one could gain from such sampling.

Suppose we will make a random selection from a finite set S={x1,...,xn} according to a probability distribution P ={p1,...,pn}, that is the probability of selecting xi on a single random selection is p(xi)=pi for each i (so 0 £ pi £ 1 and åipi=1).

Before making the random selection there is possibly some uncertainty about the outcome. After making the selection there is possibly some gain in information. Shannon wanted to formalise these intuitive notions.

One extreme case: p1=1, and all other pi=0. Here there is no uncertainty and no gain in information from the outcome.

Another almost extreme case: `Most' pi=0. Here we have a `little' uncertainty and we gain a `little' information from the outcome.

We have maximum uncertainty if p1=... = pn=1/n, and we gain maximum information from the outcome.

Shannon wanted an `uncertainty function' H(P)=H(p1,...,pn) that would give some measure of the uncertainty associated with random selection. It should depend only on the probability distribution P, and it should have the following properties.

Property 1. H is continuous.

Reason: Small changes in P should only change the value of H slightly.

Property 2. H([ 1/n],...,[ 1/n]) < H([ 1/(n+1)],...,[ 1/(n+1)]).

Reason: When all the pi are equal then increasing the number of outcomes increases the uncertainty.

Property 3. Whenever n=b1+...+bk with each bi a positive integer,
H(  1

n
,...,  1

n
) = H(  b1

n
,...,  bk

n
)+ k
å
i=1 
 bi

n
H(  1

bi
,...,  1

bi
)
.

Reason: Making a random selection by the following two step process should not change the uncertainty. Partition the set S={x1,...,xn} of size n into k blocks S={B1|B2|...|Bk} with |Bi|=bi. First choose one of the k blocks in such a way that the probability of choosing Bi is bi/n (i.e. proportional to |Bi|) - the first summand on the right hand side gives the uncertainty for this choice. Next choose an element from the selected Bi such that each of the bi elements of Bi has probability 1/bi of being chosen. The uncertainty of this second choice is H([ 1/(bi)],...,[ 1/(bi)]) and the second term on the right hand side is the average uncertainty of this second step over all the possible outcomes of the first step, i.e. åi=1kP(Bi)×(Uncertainty inchoosing from Bi).

Shannon's Uncertainty Theorem. A function H satisfies properties 1, 2 and 3 if and only if H=Hb for some b > 1, where
Hb(p1,...,pn)=- n
å
i=1 
pilogb pi= n
å
i=1 
pilogb  1

pi
and we set plogbp=0 for p=0.

The Proof of this theorem is not required for this unit. For anyone interested a proof can be found, for example, in Steven Roman's book Coding and information theory, Springer, 1992, pp.13-15.

Entropy The quantity Hb(p1,...,pn) is called the base b entropy of the distribution P ={p1,...,pn}. When b=2 we often say that H2(p1,...,pn) is the binary entropy of P.

Note the entropy measures both the amount of uncertainty before sampling, and the amount of information obtained by sampling.

Recall how logs work: for a positive real number b and a non-zero real number x, logb x is defined as the real number a such that x=ba. This is related to the natural log of x, that is log to base e, as follows: x=ba  Þ logex=loge(ba)=alogeb = logb x·loge b  Þ 


logb x = logex/loge b.
Don't forget that -logb x = logb(1/x) for all b and x.

Units of Entropy: If S={0,1,...,k-1}, then sampling from S with equal probabilities p1=p2=... = pk=[ 1/k] gives an amount of information that we will take to be `one k-ary unit of information'. Since
Hk(  1

k
,...,  1

k
)=- k
å
i=1 
 1

k
logk  1

k
=1
the base k entropy function Hk measures the number of k-ary units of information. In particular the binary entropy function H2 measures information in binary units or bits. If we sampled from S={0,1} with equal probability then one random selection would give one bit of information.

Note. In many books the term `binary units' is abbreviated to `binits' or `bits'. However this use of the word `bit' is different from our common use of `bit' for a zero or 1 in a binary string.

Example 1 Sampling from the set S={x1,x2,x3} with probabilities p1=p2=p3 = [ 1/3] gives
H2(  1

3
,  1

3
,  1

3
)=  1

3
log23+  1

3
log23+  1

3
log23=log23 » 1.585 bits
while sampling from S={x1,x2,x3} with probabilities p1=p2=[ 1/4] and p3=[ 1/2] gives
H2(  1

4
,  1

4
,  1

2
)=  1

4
log24+  1

4
log24+  1

2
log22 = 1.5 bits.
With the second sampling we are more certain about the outcome (why?) and so it makes sense that the information obtained is less.

Example 2 In Table 2.6 we gave theprobabilities of occurrence of each of the letters A to Z in English text. If we take these probabilities pA,...,pZ as the probability distribution P for random sampling from a piece of English text, and compute H2(pA,...,pZ) then the result is roughly 4. Thus on the average we get 4 bits of information by sampling a single letter from English text.

7.3  Entropy

`The' entropy function. The important entropy function
H2(p,1-p)
=
-plog2p -(1-p)log2(1-p)
=
plog2  1

p
+(1-p)log2  1

1-p
is called the entropy function and is often written as H(p). It turns out that this is the entropy function associated with a binary symmetric communication channel.

Uncertainty in transmitting one bit. For each bit transmitted by a binary symmetric channel there is a bit error probability p that an error will occur, that is, if 0 is transmitted then with probability p a 1 is received, and vice versa. We can think of each received bit being a random sample from the set S={ error, no error }, where the probabilty of error is p. The probability distribution corresponding to this is P ={p,1-p}, and its entropy is H2(P)=H(p).

Example If p=0.01 then H(0.01)=-0.01log20.01 -0.99log20.99 ~ 0.081. In fact for every value of p such that 0 < p < 1/2 we get a value H(p) strictly between 0 and 1 (see Exercise Set 7.4, question 6).

Information carried by one bit. Suppose a computer is waiting to receive a bit sent along a binary symmetric channel with bit error probability p. Before the bit is received the computer is ready to receive either a 0 or a 1, with equal probability. So the uncertainty before the bit is received is
Hbefore = H(  1

2
) = H2(  1

2
,  1

2
)=-  1

2
log2  1

2
- (1-  1

2
)log2(1-  1

2
) = -log2  1

2
= 1.
The uncertainty associated with receiving the bit is Hafter=H(p). The reason we may still have some uncertainty is that the received bit may not be equal to the bit that was sent. This uncertainty H(p) measures the amount of `noise' in the channel.

We call the difference Hbefore-Hafter=1-H(p) the information obtained from receiving the bit. If there was no noise, that is, if p=0, then H(p)=0 and the information obtained is precisely 1-0=1 binary unit of information. On the other hand if, say, p=0.01 then we obtain 1-H(0.01)=1-0.081=0.919 binary units of information for each bit transmitted.

Channel Capacity. We say that a binary symmetric channel with bit error probability p has channel capacity C=1-H(p) binary units per bit. This measures the information obtained by receipt of one bit over the channel. [The definition of capacity for other channels is more complicated than this.] The channel capacity is the constant occurring in Shannon's Noisy Channel Theorem.

7.4  Exercise Set: Information Theory


1. What are the codewords in a repetition code where each input bit is encoded as a string of length 5? Compute the probability perror of making a decision error where the bit error is p=1/4.


2. Compute (a) H2(1/8,1/8,3/4), (b) H2(1/3,2/3), (c) H2(1,0,0). (Recall the convention that plogbp=0 for p=0.)


3. Suppose we toss a fair coin, and if the outcome is `heads' then we toss it again. How much uncertainty is there in the outcome? (Use the binary entropy function.)


4. Find a relationship between Hb(P) and Hc(P).


5. Prove that `the entropy function' H(x) = -xlog2x-(1-x)log2(1-x) (0 < x < 1) is symmetric about x=1/2, that is, H(1/2-x)=H(1/2+x) for 0 < x < 1/2.


6. (a) Compute the derivative of `the entropy function' H(x)=-xlog2x-(1-x)log2(1-x), for 0 < x < 1. (b) Prove that the maximum value for H(x) on the closed interval [0,1] occurs at x=1/2. Sketch a graph of H(x) on this interval. (c) Prove that, for each p such that 0 < p < 1/2, we have 0 < H(p) < 1. [Recall that log2x=logex/loge2; also recall the convention that plog2p=0 when p=0.]


7. [From Roman, p20] The accuracy of a certain radio station's weather man at predicting rain is given by the following chart.

Actual rainActual no rain
Predicts rain1/121/6
Predicts no rain1/122/3

For example, 1/12 of the time the weatherman predicts rain when in fact it does rain. Notice that the weatherman is correct 3/4 of the time. An uninformed listener observes that he could be correct 5/6 of the time by simply always predicting no rain. He applies for the weatherman's job. However the station manager declines to hire the listener. Why?

7.5  Exercise Solutions: Information Theory


1. The codewords are 00000 and 11111; perror is the probability of making at least three bit errors. So
perror
=
æ
è
5
3
ö
ø
æ
è
 1

4
ö
ø
3
 
æ
è
 3

4
ö
ø
2
 
+ æ
è
5
4
ö
ø
æ
è
 1

4
ö
ø
4
 
æ
è
 3

4
ö
ø
+ æ
è
5
5
ö
ø
æ
è
 1

4
ö
ø
5
 
=
 1

45
æ
è
10·32+5·3+1 ö
ø
+  106

45
=  53

128
=0.414


2. (a)
H2(  1

8
,  1

8
,  3

4
) = -  1

8
log2  1

8
-  1

8
log2  1

8
-  3

4
log2  3

4
» 1.06
bits. Note -log2[ 1/8]=log28=log2 23=3, etc.

(b)
H2(  1

3
,  2

3
)=-  1

3
log2  1

3
-  2

3
log2  2

3
» 0.918 bits.
(c) H2(1,0,0)=-1·log21-0-0=0.


3. The possible outcomes are (i) tails on the first toss with probability p1=1/2, (ii) heads on the first toss followed by tails when we toss again, with probability p2=(1/2)×(1/2)=1/4, and (iii) heads on the first toss followed by heads when we toss again, with probability p3=(1/2)×(1/2)=1/4. The uncertainty is H2(1/2,1/4,1/4) which we computed as 1.5 bits in Example 7.2.1.


4. Now Hb(P)=-åi pilogb pi and Hc(P)=-åi pilogc pi. Since logb pi=loge pi/loge b = (loge pi/loge c)·(loge c/logeb)=logc pi·(logec/logeb), for each i, we get Hb(P)=Hc(P)·(loge c/logeb).


5. H(1/2+x)=-(1/2+x)log2(1/2+x)-(1-(1/2+x))log2(1-(1/2+x)) = -(1-(1/2-x))log2(1-(1/2-x)) -(1/2-x)log1(1/2-x)=H(1/2-x).


6. Since log2x=clogx where c=1/log2 and we write logx=logex, we have H(x)=-cxlogx -c(1-x)logx. For 0 < x < 1, using the product rule for differentiation,
H¢(x) = -clogx -cx·  1

x
+clog(1-x) -c(1-x)·  1

1-x
·(-1) = clog(  1-x

x
).

Now H(x) is differentiable on (0,1), and H¢(x)=0  Û log([(1-x)/x]) = 0  Û [(1-x)/x]=1  Û x=1/2. Thus there is a local extremum at x=1/2. Note that 0 < x < 1/2  Þ [(1-x)/x] = [ 1/x]-1 > 2-1=1  Þ log([(1-x)/x]) > log1=0  Þ H¢(x) > 0  Þ H(x) is increasing. By question 5, H(x) is symmetric about x=1/2 so H(x) is decreasing for 1/2 < x < 1. Hence H(x) has a local maximum at x=1/2. The values at the endpoints are H(0)=H(1)=0 so H(x) has a global maximum at x=1/2 and H(1/2) = -[ 1/2]log2[ 1/2]-[ 1/2]log2[ 1/2] = -log2[ 1/2] = 1. Thus for 0 < p < 1/2, we have 0 < H(p) < 1.


7. The station manager made his decision based on the relative amounts of information that the weatherman and the listener could provide. Let's model this by assuming that there are four possible events: RR=`actual rain and rain predicted'; RN=`actual rain and no rain predicted'; NR=`actual no rain and rain predicted'; NN=`actual no rain and no rain predicted'. From the data given the probability of rain is 1/6, and of no rain is 5/6.

For the weatherman the probability distribution PW to be used is pRR=1/12, pRN=1/12, pNR=1/6, pNN=2/3 (from the table given). The information in binary units conveyed by a single announcement is therefore
H2(PW) = -  1

12
log2(  1

12
)-  1

12
log2(  1

12
)-  1

6
log2(  1

6
)-  2

3
log2(  2

3
) » 1.4183.

For the uninformed listener the probability distribution PL to be used is pRR=0, pRN=1/6, pNR=0, pNN=5/6. The information in binary units conveyed by a single announcement is therefore
H2(PL) = -  1

6
log2(  1

6
)-  5

6
log2(  5

6
) » 0.6500.
Since the weatherman was able to give greater information from his forecasts, the station manager chose to retain him in the job.


Cheryl E. Praeger
File translated from TEX by TTH,version 3.
On Sat Oct 27 09:46:32 2001.

Contents

1  Introduction
    1.1  CODES
    1.2  CIPHERS
2  Classical cryptography
    2.1  The basic idea
    2.2  A simple cryptosystem
    2.3  Shift ciphers
    2.4  Using a Shift Cipher
    2.5  Substitution ciphers
    2.6  Polyalphabetic, block and stream ciphers
    2.7  Cryptanalysis
    2.8  Exercise Set: Classical Ciphers
    2.9  Exercise Solutions: Classical Ciphers
3  Modular Arithmetic
    3.1  Arithmetic mod n
    3.2  GCD's
    3.3  Inverses and Euler's phi function
    3.4  Polynomials
    3.5  Finite fields
    3.6  Fermat and Euler
    3.7  Vector spaces
    3.8  Exercise Set: Modular arithmetic
    3.9  Solutions: ModArith
4  Linear Feedback Shift Registers
    4.1  Keystreams
    4.2  Primitive polynomials
    4.3  Randomness
    4.4  Exercise Set: LFSR's
    4.5  Solutions: LFSR's
5  Data Encryption Standard
    5.1  Secret key cryptography
    5.2  DES
    5.3  Modes of operation
    5.4  The DES Controversy
    5.5  Uses of DES
    5.6  Modern developments
6  Public Key Cryptosystems
    6.1  Public keys
    6.2  One-way functions
    6.3  The RSA cryptosystem
    6.4  Primes and factorisation
    6.5  Some related topics
    6.6  Exercise Set: Public Key Cryptography
    6.7  Exercise Solutions: Public Key Cryptography
7  Information Theory
    7.1  Redundancy
    7.2  Information and uncertainty
    7.3  Entropy
    7.4  Exercise Set: Information Theory
    7.5  Exercise Solutions: Information Theory
8  Codes: an introduction
    8.1  Block codes
    8.2  ISBN Numbers
    8.3  ASCII code
    8.4  Hamming distance
    8.5  Maximum likelihood
    8.6  Exercise Set: Codes I
    8.7  Exercise Solutions: Codes I
9  Linear Codes
    9.1  Linear (n,k) codes
    9.2  Dual Codes
    9.3  Parity check matrices
    9.4  Information about Hamming distance
    9.5  Exercise Set: Codes II
    9.6  Ex. Solns: Codes II
10  Families of codes and perfect codes
    10.1  Hamming codes
    10.2  Perfect codes and the Sphere Packing Bound
    10.3  Golay codes
    10.4  The (r+1,r) parity check code.
    10.5  Repetition codes
    10.6  Hadamard codes
11  Decoding with parity check matrices
    11.1  The Set-up
    11.2  Cosets of codes
    11.3  Syndromes
12  In the News
    12.1  E-books
    12.2  The Cayley-Purser algorithm
13  Assignments and Test
    13.1  Assignment 1: Cryptography
    13.2  Assignment 1 SOLUTIONS
    13.3  Assignment 2: Public key Cryptography and Codes
    13.4  Assignment 2 SOLUTIONS
    13.5  Test: 3CC
    13.6  Test Solutions
    13.7  2001 Sample Exam