How to calculate the entropy of a file?

0 votes
asked Jun 13, 2009 by ivan-ivanovich-ivano

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.

My idea is the following:

  • Create an array of 256 integers (all zeros).
  • Traverse through the file and for each of its bytes,
    increment the corresponding position in the array.
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero,
    and for each of the array's entries:
    add the entry's difference to "average" to the counter.

Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...

I hope someone has better and simpler solutions?

Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)

11 Answers

0 votes
answered Jan 13, 2009 by bayer

If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.

Or, if the contents of the file is unicode characters, you should use those, etc.

0 votes
answered Jan 13, 2009 by adam-rosenfield

There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).

In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.

0 votes
answered Jun 13, 2009 by igor-krivokon

A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).

This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.

0 votes
answered Jun 13, 2009 by nick-dandoulakis
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.

With some modifications you can get Shannon's entropy:

rename "average" to "entropy"

(float) entropy = 0
for i in the array[256]:Counts do 
  (float)p = Counts[i] / filesize
  if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2

Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).

0 votes
answered Jun 13, 2009 by wesley

To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)

The following variables are assumed to already exist:

  • byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.

  • total is the total number of bytes in your file.

I'll write the following code in Python, but it should be obvious what's going on.

import math

entropy = 0

for count in byte_counts:
    # If no bytes of this value were seen in the value, it doesn't affect
    # the entropy of the file.
    if count == 0:
        continue
    # p is the probability of seeing this byte in the file, as a floating-
    # point number
    p = 1.0 * count / total
    entropy -= p * math.log(p, 256)

There are several things that are important to note.

  • The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.

  • The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.

The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).


An explanation for the use of log base 256

It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.

It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.

Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.

In summary:

  • You can use various units to express entropy
  • Most people express entropy in bits (b=2)
    • For a collection of bytes, this gives a maximum entropy of 8 bits
    • Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
  • The algorithm above calculates entropy in bytes (b=256)
    • This is equivalent to (entropy in bits) / 8
    • This already gives a value between 0 and 1
0 votes
answered Jun 13, 2009 by peter-kovacs

Is this something that ent could handle? (Or perhaps its not available on your platform.)

$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...

As a counter example, here is a file with no entropy.

$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
0 votes
answered Jun 21, 2011 by jeff-atwood

For what it's worth, here's the traditional (bits of entropy) calculation represented in c#

/// <summary>
/// returns bits of entropy represented in a given string, per 
/// http://en.wikipedia.org/wiki/Entropy_(information_theory) 
/// </summary>
public static double ShannonEntropy(string s)
{
    var map = new Dictionary<char, int>();
    foreach (char c in s)
    {
        if (!map.ContainsKey(c))
            map.Add(c, 1);
        else
            map[c] += 1;
    }

    double result = 0.0;
    int len = s.Length;
    foreach (var item in map)
    {
        var frequency = (double)item.Value / len;
        result -= frequency * (Math.Log(frequency) / Math.Log(2));
    }

    return result;
}
0 votes
answered Jan 31, 2013 by user2622016

Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:

  • each character is equally probable
  • there are 95 printable characters in byte
  • log(95)/log(2) = 6.6

Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).

In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.

If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.

Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.

0 votes
answered Jan 10, 2014 by iggy-pop

Calculates entropy of any string of unsigned chars of size "length". This is basically a refactoring of the code found at http://rosettacode.org/wiki/Entropy. I use this for a 64 bit IV generator that creates a container of 100000000 IV's with no dupes and a average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html

#include <string>
#include <map>
#include <algorithm>
#include <cmath>
typedef unsigned char uint8;

double Calculate(uint8 * input, int  length)
  {
  std::map<char, int> frequencies;
  for (int i = 0; i < length; ++i)
    frequencies[input[i]] ++;

  double infocontent = 0;
  for (std::pair<char, int> p : frequencies)
  {
    double freq = static_cast<double>(p.second) / length;
    infocontent += freq * log2(freq);
  }
  infocontent *= -1;
  return infocontent;
 }
0 votes
answered Jan 11, 2016 by nealmcb

Re: I need the whole thing to make assumptions on the file's contents: (plaintext, markup, compressed or some binary, ...)

As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.

jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.

enter image description here Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]

More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY

Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.

This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...