Chinaunix首页 | 论坛 | 博客
  • 博客访问: 17383
  • 博文数量: 22
  • 博客积分: 920
  • 博客等级: 准尉
  • 技术积分: 220
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-27 15:19
文章分类

全部博文(22)

文章存档

2010年(22)

我的朋友
最近访客

分类:

2010-03-27 15:48:36

    Information Entropy

    In , the Shannon entropy or information entropy is a measure of the uncertainty associated with a .

Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 ), that must be sent to communicate the true value of the random variable to a recipient. Intuitively, it measures how many yes/no questions must be answered, on average, to communicate each new outcome of the random variable. This represents a fundamental mathematical limit on the best possible lossless of any communication: considering a certain message as a series of realisations of a certain random variable (where the random variable can be "the next letter", or "the next word", or any other subdivision of text), the shortest number of bits necessary to transmit one message is the Shannon entropy of the corresponding random variable, multiplied by the number of realisations in the original text.

characters, if chosen uniformly at random, have an entropy of exactly 7 per character. However, if one knows that one or more characters are chosen more frequently than others, then our uncertainty is lower (if asked to bet on the next outcome, we would bet preferentially on the most frequent characters), and thus the Shannon entropy is lower. A long string of repeating characters has an entropy of 0, since every character is predictable. The entropy of English text is between 1.0 and 1.5 bits per letter.

Equivalently, the Shannon entropy is a measure of the average the recipient is missing when he does not know the value of the random variable.

The concept was introduced by in his 1948 paper "".

/*

信息学中的熵代表的是信息量的大小,如果信息量越大,那么熵则越大。比如“世界杯冠军是谁”,若32个队夺冠的几率相等,这条信息的熵是5,即可以用5个二进制位来表示冠军球队,或者要问5个问题才能得答案。比如:是前16个队吗?是。是这16个队的前8个队吗?不是。...。另外从二进制的表示来看,32个队要用0000~11111来表示,5个比特。当然如果各个队夺冠的几率不等的话,那么熵则不会是5。因为如果从强队开始问问题,则不需要5个问题一般就可以得到答案。而根据哈夫曼编码的启发,我们表示这32个队,不需要平均长度为5的二进制位。

 

从另外一个角度来看,熵作为信息量的大小也是表达能力的衡量,或者是不确定度的衡量和混乱程度的表现。为什么选择log来表示熵呢?如果X的表达能力为H(X),Y的表达能力为H(Y),那么X+Y的表达能力应该是二者的乘积,所以H(X + Y) = H(X)  + H(Y)。

*/

     Definition

The information entropy of a X, that can take on possible values {x1...xn} is

where

I(X) is the information content or of X, which is itself a random variable; and
p(xi) = Pr(X=xi) is the of X.
阅读(689) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~