Chapter 7
|
|
Conclusion 1. Encoding the original source data by introducing
redundancy may reduce the probability of decision error.
A general repetition code involves transmitting a string 00... 0
of length 2n+1 for each input bit 0, and transmitting a string 11... 1 of
length 2n+1 for each input bit 1. We use an odd number 2n+1 to enable us to
use the `majority decoding rule': if a received string of odd length 2n+1
contains more zeros than ones then we decode it as 0, and vice versa.
Thus the probability perror of making a decision error is the probability that there are at least n+1 bit errors in the received string. The number of bit errors is a binomial random variable with parameters (2n+1, p) and the theory of binomial random variables tells us
However, using 2n+1 units of channel time to send one source bit is very expensive.
One measure of this transmission cost is
|
Conclusion 2. The probability of decision error can be made arbitrarily
small at the expense of lowering the rate of transmission.
However the rate of transmission might become unacceptably small.
Basic Idea. We usually group the source data, one or several bits
at a time, and
encode each group of bits into a codeword by an encoding scheme
(such as a repetition code) which assigns a codeword to each group from the
source message. This encoding adds redundancy to enable
detection of errors, and possibly also correction of errors.
Source ®1011 encoder ®1011010
channel (may introduce errors) ®1010010
decoder (corrects errors) ®1011 receiver.
There is a remarkable result due to Shannon (1948), and father of Information Theory. Claude Shannon was the mathematician who laid the foundations of Information Theory while working at Bell Labs in the US. He died this year (2001) at the age of 84. You can find an obituary for him at the
Shannon's Noisy Channel Theorem For every channel there is a number
C called the capacity of the channel such that, if it is
acceptable to have a rate of transmission less than C, then there exists an
encoding scheme that will reduce the probability of decision error to any
desired (low) level.
It turns out that, for the binary symmetric channel,
|
For example, if p=0.01 then C ~ 0.919.
Problems 1. Shannon's Noisy Channel Theorem does not give a
construction for these good encoding schemes. The proof only shows that they
exist. No one has yet discovered how to construct them.
2. There are other desirable qualities for encoding schemes that compete with the quality of having a very small probability of decision error. For example, we would like to have easy encoding and decoding procedures. Also we want the codewords to have moderate length (and as we saw with repetition codes, sometimes a small decision error probability requires very long codewords).
Shannon(1948) introduced a function to measure the uncertainty involved in
making random selections or sampling from a finite set, and he wanted also to
measure the information one could gain from such sampling.
Suppose we will make a random selection from a finite set S={x1,...,xn}
according to a probability distribution P
={p1,...,pn}, that is the
probability of selecting xi on a single random selection is
p(xi)=pi for each i (so 0 £ pi £ 1 and åipi=1).
Before making the random selection there is possibly some uncertainty about the
outcome. After making the selection there is possibly some gain in information.
Shannon wanted to formalise these intuitive notions.
One extreme case: p1=1, and all other pi=0. Here there is no uncertainty
and no gain in information from the outcome.
Another almost extreme case: `Most' pi=0. Here we have a `little'
uncertainty and we gain a `little' information from the outcome.
We have maximum uncertainty if p1=... = pn=1/n, and we gain maximum
information from the outcome.
Shannon wanted an `uncertainty function' H(P)=H(p1,...,pn) that would
give some measure of the uncertainty associated with random selection.
It should depend only on the probability distribution P, and
it should have the following properties.
Property 1. H is continuous.
Reason: Small changes in P should only change the
value of H slightly.
Property 2.
H([ 1/n],...,[ 1/n]) < H([ 1/(n+1)],...,[ 1/(n+1)]).
Reason: When all the pi are equal then increasing the
number of outcomes increases the uncertainty.
Property 3. Whenever n=b1+...+bk with each bi a positive integer,
.
H(
1
,...,
1
) = H(
b1
,...,
bk
)+
k
å
i=1
bi
H(
1
,...,
1
)
Reason: Making a random selection by the following two step process
should not change the uncertainty. Partition the set S={x1,...,xn} of
size n into k blocks S={B1|B2|...|Bk} with |Bi|=bi. First
choose one of the k blocks in such a way that the probability of choosing
Bi is bi/n (i.e. proportional to |Bi|) - the first summand on the
right hand side gives the uncertainty for this choice. Next choose an element
from the selected Bi such that each of the bi elements of Bi has
probability 1/bi of being chosen. The uncertainty of this second choice is
H([ 1/(bi)],...,[ 1/(bi)]) and the second term on the right hand
side is the average uncertainty of this second step over all the possible
outcomes of the first step, i.e. åi=1kP(Bi)×(Uncertainty inchoosing from Bi).
Shannon's Uncertainty Theorem. A function H satisfies properties
1, 2 and 3 if and only if H=Hb for some b > 1, where
|
The Proof of this theorem is not required for this unit. For anyone
interested a proof
can be found, for example, in Steven Roman's book Coding and
information theory, Springer, 1992, pp.13-15.
Entropy The quantity Hb(p1,...,pn) is called the
base b entropy of the distribution P
={p1,...,pn}. When b=2 we
often say that H2(p1,...,pn) is the binary entropy of
P.
Note the entropy measures both the amount of uncertainty before
sampling, and the amount of information obtained by sampling.
Recall how logs work: for a positive real number b and a non-zero real number x, logb x is defined as the real number a such that x=ba. This is related to the natural log of x, that is log to base e, as follows: x=ba Þ logex=loge(ba)=alogeb = logb x·loge b Þ
|
Units of Entropy: If S={0,1,...,k-1}, then sampling
from S with equal probabilities p1=p2=... = pk=[ 1/k] gives an
amount of information that we will take to be `one k-ary unit of information'.
Since
|
Note. In many books the term `binary units' is abbreviated to
`binits' or `bits'. However this use of the word `bit' is different from
our common use of `bit' for a zero or 1 in a binary string.
Example 1
Sampling from the set S={x1,x2,x3} with probabilities p1=p2=p3 = [ 1/3] gives
while sampling from S={x1,x2,x3} with probabilities p1=p2=[ 1/4]
and p3=[ 1/2] gives
H2(
1
,
1
,
1
)=
1
log23+
1
log23+
1
log23=log23 » 1.585 bits
With the second sampling we are more certain about the outcome (why?) and
so it makes sense that the information obtained is less.
H2(
1
,
1
,
1
)=
1
log24+
1
log24+
1
log22 = 1.5 bits.
`The' entropy function. The important entropy function
|
Uncertainty in transmitting one bit. For each bit transmitted by a
binary symmetric channel there is a bit error probability p that an error will
occur, that is, if 0 is transmitted then with probability p a 1 is received,
and vice versa. We can think of each received bit being a random sample from
the set S={ error, no error }, where the probabilty of error is p. The
probability distribution corresponding to this is P
={p,1-p}, and its
entropy is H2(P)=H(p).
Example If p=0.01 then H(0.01)=-0.01log20.01 -0.99log20.99 ~ 0.081. In fact for every value of p such that 0 < p < 1/2 we get a value
H(p) strictly between 0 and 1 (see Exercise Set 7.4, question 6).
Information carried by one bit. Suppose a computer
is waiting to receive a bit sent along a binary symmetric channel with
bit error probability p. Before the bit is received the computer is ready to
receive either a 0 or a 1, with equal probability. So the uncertainty before the
bit is received is
|
We call the difference Hbefore-Hafter=1-H(p) the
information obtained from receiving the bit.
If there was no noise, that is, if p=0, then H(p)=0 and the information
obtained is precisely 1-0=1 binary unit of
information. On the other hand if, say, p=0.01 then we obtain
1-H(0.01)=1-0.081=0.919 binary units of information for each bit transmitted.
Channel Capacity.
We say that a binary symmetric channel with bit error probability p has
channel capacity C=1-H(p) binary units per bit. This measures
the information obtained by receipt of one bit over the channel. [The definition of
capacity for other channels is more complicated than this.]
The channel capacity is the constant occurring in Shannon's Noisy Channel
Theorem.
1. What are the codewords in a repetition code where
each input bit is encoded as a string of length 5? Compute the probability
perror of making a decision error where the bit error is p=1/4.
2. Compute (a) H2(1/8,1/8,3/4), (b) H2(1/3,2/3),
(c) H2(1,0,0). (Recall the convention that plogbp=0 for p=0.)
3. Suppose we toss a fair coin, and if the outcome is
`heads' then we toss it again. How much uncertainty is there in the outcome?
(Use the binary entropy function.)
4. Find a relationship between Hb(P) and
Hc(P).
5. Prove that `the entropy function' H(x) = -xlog2x-(1-x)log2(1-x) (0 < x < 1) is symmetric about x=1/2, that is,
H(1/2-x)=H(1/2+x) for 0 < x < 1/2.
6. (a) Compute the derivative of `the entropy function'
H(x)=-xlog2x-(1-x)log2(1-x), for 0 < x < 1. (b) Prove that the maximum
value for H(x) on the closed interval [0,1] occurs at x=1/2. Sketch a
graph of H(x) on this interval. (c) Prove that, for each p such that
0 < p < 1/2, we have 0 < H(p) < 1. [Recall that log2x=logex/loge2; also
recall the convention that plog2p=0 when p=0.]
7. [From Roman, p20] The accuracy of a certain radio
station's weather man at predicting rain is given by the following chart.
| Actual rain | Actual no rain | |
| Predicts rain | 1/12 | 1/6 |
| Predicts no rain | 1/12 | 2/3 |
For example, 1/12 of the time the weatherman predicts rain when in fact it does rain. Notice that the weatherman is correct 3/4 of the time. An uninformed listener observes that he could be correct 5/6 of the time by simply always predicting no rain. He applies for the weatherman's job. However the station manager declines to hire the listener. Why?
1. The codewords are 00000 and 11111; perror is
the probability of making at least three bit errors. So
|
2. (a)
|
(b)
|
3. The possible outcomes are (i) tails on the first toss
with probability p1=1/2, (ii) heads on the first toss followed by tails when
we toss again, with probability p2=(1/2)×(1/2)=1/4, and (iii) heads on
the first toss followed by heads when we toss again, with probability
p3=(1/2)×(1/2)=1/4. The uncertainty is H2(1/2,1/4,1/4) which we
computed as 1.5 bits in Example 7.2.1.
4. Now Hb(P)=-åi pilogb pi and
Hc(P)=-åi pilogc pi. Since logb pi=loge pi/loge b = (loge pi/loge c)·(loge c/logeb)=logc pi·(logec/logeb), for each i, we get Hb(P)=Hc(P)·(loge c/logeb).
5. H(1/2+x)=-(1/2+x)log2(1/2+x)-(1-(1/2+x))log2(1-(1/2+x)) = -(1-(1/2-x))log2(1-(1/2-x)) -(1/2-x)log1(1/2-x)=H(1/2-x).
6. Since log2x=clogx where c=1/log2 and
we write logx=logex, we have H(x)=-cxlogx -c(1-x)logx.
For 0 < x < 1, using the product rule for differentiation,
|
Now H(x) is differentiable on (0,1), and H¢(x)=0 Û log([(1-x)/x]) = 0 Û [(1-x)/x]=1 Û x=1/2. Thus there is a local extremum at x=1/2. Note that 0 < x < 1/2 Þ [(1-x)/x] = [ 1/x]-1 > 2-1=1 Þ log([(1-x)/x]) > log1=0 Þ H¢(x) > 0 Þ H(x) is increasing. By question 5, H(x) is symmetric about x=1/2 so H(x) is decreasing for 1/2 < x < 1. Hence H(x) has a local maximum at x=1/2. The values at the endpoints are H(0)=H(1)=0 so H(x) has a global maximum at x=1/2 and H(1/2) = -[ 1/2]log2[ 1/2]-[ 1/2]log2[ 1/2] = -log2[ 1/2] = 1. Thus for 0 < p < 1/2, we have 0 < H(p) < 1.
7. The station manager made his decision based on the
relative amounts of information that the weatherman and the listener could
provide. Let's model this by assuming that there are four possible events:
RR=`actual rain and rain predicted'; RN=`actual rain and no rain predicted';
NR=`actual no rain and rain predicted'; NN=`actual no rain and no rain
predicted'. From the data given the probability of rain is 1/6, and of no
rain is 5/6.
For the weatherman the probability distribution PW to be
used is pRR=1/12, pRN=1/12, pNR=1/6, pNN=2/3 (from the
table given). The information
in binary units conveyed by a single announcement is therefore
|
For the uninformed listener the probability distribution PL to be
used is pRR=0, pRN=1/6, pNR=0, pNN=5/6. The information
in binary units conveyed by a single announcement is therefore
|