Information theory is based on probability and statistics.
Major application/goal:
A communication system: (symbolically represented as)
message - signal - (received) signal - message
coding process - noise - decoding process
information source - transmitter - channel - receiver - destination
Note: A coding process is the action of the transmitter in changing the message into the signal.
Figure: Shannon's diagram of a general communications system
Questions concerning this communication system:
Information is a measure of one’s freedom of choice when one selects a message. If one is confronted with a very elementary situation where he has to choose one of two alternative messages, then it is arbitrarily said that the information, associated with this situation, is unity.
The real reason that Level A (technical problem) analysis deals with a concept of information which characterizes the whole statistical nature of the information source, and is not concerned with the individual messages (and not at all directly concerned with the meaning of the individual messages) is that from the point of view of engineering, a communication system must face the problem of handling any message that the source can produce. … This sort of consideration leads at once to the necessity of characterizing the statistical nature of the whole ensemble of messages which a given kind of source can and will produce.
The ratio of the actual to the maximum entropy is called the relative entropy of the source.
One minus the relative entropy is called the redundancy. This is the fraction of the structure of the message which is determined not by the free choice of the sender, but rather by the accepted statistical rules governing the use of the symbols in question. [Just like the grammar to a language.]
The concept of information developed in this theory at first seems disappointing and bizarre -- disappointing because it has nothing to do with meaning, and bizarre because it deals not with a single message but rather with the statistical character of a whole ensemble of messages, bizarre also because in these statistical terms the two words information and uncertainty find themselves to be partners.
The capacity of a channel is to be described in terms of its ability to transmit what is produced out of source of a given information.
For a noiseless channel transmitting discrete symbols, the capacity of the communication channel is C bit/s, the source of entropy (or information) produces signals H bit/s, then by devising proper coding procedures for the transmitter it is possible to transmit symbols over the channel at an average rate which is nearly $\frac{C}{H}$, but which can never be made to exceed $\frac{C}{H}$.
The best transmitter, in fact, is that which codes the message in such a way that the signal has just those optimum statistical characteristics which are best suited to the channel to be used -- which in fact maximize the signal entropy and make it equal to the capacity C of the channel.
It is generally true that when there is noise, the received signal exhibits greater information -- or better, the received signal is selected out of a more varied set than is the transmitted signal.
Uncertainty which arises by virtue of freedom of choice on the part of the sender is desirable uncertainty. Uncertainty which arises because of errors or because of the influence of noise is undesirable uncertainty.
Equivocation is the entropy of the message relative to the signal, which measures the average uncertainty in the message when the signal is known.
The capacity of a noisy channel is defined to be the maximum rate (in bits per second) at which useful information (total uncertainty minus noise uncertainty) can be transmitted over the channel.
Suppose that a noisy channel has a capacity C, suppose it is accepting from an information source characterized by an entropy of $H(x)$ bit/s, the entropy of the received signals being $H(y)$ bit/s. Then,
For a channel of frequency bandwidth W, when the average power used in transmitting is P, the channel being subject to a noise of power N, this noise being “white thermal noise” of a special kind which Shannon defines. Then it is possible, by the best coding, to transmit binary digits at the rate of (maximum capacity) $$C = W \log \frac{P+N}{N}$$ and have an arbitrarily low frequency of error.
The most important ones:
Note on each axioms:
Entropy of a discrete random variable X is the expectation of its self-information $I(X)$.
$$H(X) = \operatorname{E}_X [I(X)] = \operatorname{E}_X [-\log_2 P(X)]$$
Self-information $I(X)$ is a random variable that indicates the entropy contribution of an individual message X.
$$I(X) = -\log_2 P(X)$$
The unit of information is bit, while 'nat' and 'hartley' may also be used in non-standard definitions.
Note:
Joint entropy of two discrete r.v.’s X and Y is the entropy of r.v. $(X, Y)$.
$$H(X,Y) = \operatorname{E}_{X,Y} [-\log_2 P(X,Y)]$$
It can be shown that $H(X,Y) \leq H(X) + H(Y)$ with equality if and only if X and Y are independent.
Conditional entropy of r.v. X given r.v. Y is the average conditional entropy of X given Y.
$$H(X|Y) = \operatorname{E}_Y [H(X|Y=y)]$$
It can be shown that $$H(X|Y) = H(X,Y) - H(Y)$$
Mutual information $I(X;Y)$ of X relative to Y is the amount of information that can be obtained about one random variable by observing another, in other words, the information conveyed about X by Y.
$$I(X;Y) = H(X) - H(X|Y)$$
Note:
Figure: Individual, joint, conditional and mutual entropies for a pair of correlated subsystems X,Y
Source coding is the process of taking the source data and making it smaller. While the idea of Channel coding is to find codes that transmit quickly, contain many valid code words and can correct or at least detect many errors. (noise & redundancy)
Information rate is the average entropy per symbol.
Channel capacity is $$C = \sup_{P_X} I(X;Y)$$
Let a source have entropy H (bits per symbol) and a channel have a capacity C (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at the average rate $\frac{C}{H} - e$ symbols per second over the channel where e is arbitrarily small. It is not possible to transmit at an average rate greater than $\frac{C}{H}$.
Note: The essential content of this theorem is that the average number of “yes or no” questions needed to specify the value of X can never be less than the uncertainty (measure) of X.
Let a discrete channel have the capacity C and a discrete source the entropy per second H. If $H \leq C$ there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors (or an arbitrarily small equivocation). If $H>C$ it is possible to encode the source so that the equivocation is less than $H-C+e$ where e is arbitrarily small. There is no method of encoding which gives an equivocation less than $H-C$.
In the discrete case using base two logarithms, the reduced Gibbs entropy is equal to the minimum number of yes/no questions needed to be answered in order to fully specify the microstate, given that we know the macrostate.
In fact one can generalise: any information that has a physical representation must somehow be embedded in the statistical mechanical degrees of freedom of a physical system.
When information is physical, all processing of its representations, i.e. generation, encoding, transmission, decoding and interpretation, are natural processes where entropy increases by consumption of free energy.