MD5 Length Extension Attack Revisited

2 years ago, Thai Duong and Juliano Rizzo  published a paper about a security hole of Flickr's API which later become one of 10 best web hacking technique of 2009

I am very interested in their research, however I don't have time to read them until recently.

To summarize, the authors smartly use the widely known (among cryptographer) attack on MD5 namely "Length-Extension Attack" to forge a signature of Flickr without knowing the secret key.

The signature was vulnerable because It was computed as :

H = md5( SECRET_KEY | DATA )

This is known that given H, without any knowledge of [SECRET-KEY], one can compute H' :

H' = md5( SECRET_KEY | DATA | DATA') 

I have not read the literature of this attack, but by studying how a MD5 string is compute, It's not difficult to reproduce. Given a string M=SECRET_KEY | DATA, the hash function produce md5(M) as follows:

  1. Calculate L = Length(M) in bits
  2. Append 1 then Append 0 until length(M) in bits = 448 (modulo 512)
  3. Append L as 64-bits to the message so that length(M) is multiple of 512
  4. Break M into k chunk of 512 bit or 64-bytes: m[1]..m[k]
  5. Compute H[i=1..k] = F( H[i-1], m[i] ) . H[0] = IV (initial vector - some magic number), F is some black box function that we don't need to care about.
  6. Output Finalize(H[k])

The Finalize() function is just to output H in a readable manner. So suppose we can reverse Finalize(H[k]) into H[k] and somehow add m[k+1],m[k+2] to the message, we can compute H[k+1], H[k+2] (which is equivalent to creating md5(M|m[k+1]|m[k+2])) easily.

So first, let's see how finalize(H[k]) is:

h0,h1,h2,h3 = init_vector
for each 512-bit chunk of message
    var int a := h0
    var int b := h1
    var int c := h2
    var int d := h3
    (a,b,c,d) = F(a,b,c,d, CHUNK)...
    h0 := h0 + a
    h1 := h1 + b
    h2 := h2 + c
    h3 := h3 + d
end
var char digest[16] := h0 append h1 append h2 append h3

The last line is finalize, .. So it just combines h0,h1,h2,h3... which is the output of F() or H[k].
So given an MD5 string, we can reverse it to raw state using following python code

def reverse(md5string):
    h0 = struct.unpack("I",md5string[0:8].decode('hex')) [0]
    h1 = struct.unpack("I",md5string[8:16].decode('hex')) [0]
    h2 = struct.unpack("I",md5string[16:24].decode('hex')) [0]
    h3 = struct.unpack("I",md5string[24:32].decode('hex')) [0]

Now to add m[k+1], m[k+2], we have to pad the message so that the result up to the last round i.e. H[k] shall be the same. Our final message will be as following:

M|1..000|length(M)|DATA'
length(M|1..000|length(M)) mod 512 == 0

To compute the md5 of the message above, we just have to use h0,h1,h2,h3 we just computed as our IV and compute md5 of DATA' based on that IV. To do so, one can rewrite any implementation of md5 to receive a customized initial vectors, and added length(M|1..000|length(M)) (remember length as number of bits) to the length when computing the padding for DATA', let it be md5_special:

To summarize, given H=md5(M=SECRET_KEY | DATA), we can compute:

H'= md5(M|1..000|length(M)|DATA')
= md5_special(reverse(H), length(M|1..000|length(M)), DATA')

In flirkr, this attack is successful because one knows the length of the secret key. In practice, if the length is unknown, one can also brute force by sending multiple messages with different assumptions to see how is the response.

By the time I write this blog article, the bug from Flirkr had been fixed long ago. But there is no evidence that It has been fixed by every application because knowledge may not be known and ignorance are so common.

Update: I decide to publish my POC. You can find it here: https://github.com/danghvu/md5pad