13
\$\begingroup\$

Here is a implementation of the cryptographic hash function SHA1 written in Python. It does not use any external libraries, only built-in functions. I know that it would be faster to use an external library, but my code is for learning purposes.

I want to know if I am properly implementing the algorithm. Am I taking any unnecessary steps? What changes can I make to speed up the code?

The code properly works. I've compared the result to a trusted implementation.

def sha1(data): bytes = "" h0 = 0x67452301 h1 = 0xEFCDAB89 h2 = 0x98BADCFE h3 = 0x10325476 h4 = 0xC3D2E1F0 for n in range(len(data)): bytes+='{0:08b}'.format(ord(data[n])) bits = bytes+"1" pBits = bits #pad until length equals 448 mod 512 while len(pBits)%512 != 448: pBits+="0" #append the original length pBits+='{0:064b}'.format(len(bits)-1) def chunks(l, n): return [l[i:i+n] for i in range(0, len(l), n)] def rol(n, b): return ((n << b) | (n >> (32 - b))) & 0xffffffff for c in chunks(pBits, 512): words = chunks(c, 32) w = [0]*80 for n in range(0, 16): w[n] = int(words[n], 2) for i in range(16, 80): w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1) a = h0 b = h1 c = h2 d = h3 e = h4 #Main loop for i in range(0, 80): if 0 <= i <= 19: f = (b & c) | ((~b) & d) k = 0x5A827999 elif 20 <= i <= 39: f = b ^ c ^ d k = 0x6ED9EBA1 elif 40 <= i <= 59: f = (b & c) | (b & d) | (c & d) k = 0x8F1BBCDC elif 60 <= i <= 79: f = b ^ c ^ d k = 0xCA62C1D6 temp = rol(a, 5) + f + e + k + w[i] & 0xffffffff e = d d = c c = rol(b, 30) b = a a = temp h0 = h0 + a & 0xffffffff h1 = h1 + b & 0xffffffff h2 = h2 + c & 0xffffffff h3 = h3 + d & 0xffffffff h4 = h4 + e & 0xffffffff return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4) print sha1("hello world") 
\$\endgroup\$
4
  • 2
    \$\begingroup\$docs.python.org/2/library/hashlib.html\$\endgroup\$CommentedDec 18, 2013 at 2:47
  • 1
    \$\begingroup\$@AlexBuchanan I said that my code was not for practical use.\$\endgroup\$
    – kyle k
    CommentedDec 18, 2013 at 5:09
  • 1
    \$\begingroup\$That's what the reinventing-the-wheel tag is for. I've added it for you.\$\endgroup\$CommentedDec 18, 2013 at 9:51
  • \$\begingroup\$Ah, my bad, I must have skimmed over that line.\$\endgroup\$CommentedDec 18, 2013 at 17:26

3 Answers 3

11
\$\begingroup\$

I'm no expert on SHA1, and I see mostly small things, but one big thing that gave me pause. As an overall observation, you seem to value readability over memory usage. Since hashes are often used for large files, this seems to be a risky tradeoff, and python does provide good ways to handle this as well. More of this in the third observation below. On to the observations:

  • pBits being represented as a series of characters for each bit seems unusual for such low level work. I would typically expect to see this as a hexadecimal string, or even a number. Yet except for possible difficulties comparing it to other implementations that favor less space usage, this seems to make things easy to read.
  • Filling out pBits to lengths congruent to 448 mod 512 could be done in a single step by figuring out how many bits of padding are necessary and doing a single pBits += "0" * padding; there's no need for a while loop.
  • This one's important: chunks() should probably be a generator so that you don't have three copies of your data around (one original, one expanded a bit to a byte, and one in chunked strings); correspondingly this could serve out the trailing zero and length bits. Let's examine what this would look like, as 3 copies of large data is really large. Let's remove the sliced copies, creating only one chunk at a time:

    def iterchunks(bits, chunkbits=512): for i in range(0, len(bits), chunkbits): yield bits[i:i + chunkbits] 

    Note that if the pBits generation was here, checking i + chunkbits against len(bits) and the bytes to bits conversion, you could process a file-like stream instead of a string, taking you further down, from the original 3 copies (with two of them 8 times larger than usual) to not even a full copy of your data in memory at one time.

  • Back to small ideas. Your bit string could be more literally a bit string, storing \0 and \1 characters. This would let you simplify setting up w into w = map(ord, words). (Of course with the appropriate helper, you could do that with the 0 and 1 characters of your current string.)
  • I'm not a big fan of the for i in range(0, 80): if 0 <= i <= 19: ... section. It feels like it might read better with a series of for loops that do their different thing then call a helper:

    for i in range(0, 19): f = (b & c) | ((~b) & d) k = 0x5A827999 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i]) for i in range(20, 39): f = b ^ c ^ d k = 0x6ED9EBA1 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i]) ... 

    but that seems pretty ridiculous, and would probably hurt performance to boot. After looking more closely at the obvious alternative, I'm going to withdraw this complaint.

\$\endgroup\$
    7
    \$\begingroup\$

    Serializing the data into a string of '1' and '0' characters, then parsing the string back into numbers, seems incredibly wasteful. You should aim to do the whole thing without stringifying the numbers. I've done it using struct.pack() and struct.unpack(), though you could also use basic functions like ord() with more bit manipulation.

    I've kept everything starting from the main loop about the same. The minor changes I've made are to use parallel assignments and inclusive-exclusive ranges, both of which are in the spirit of Python.

    from struct import pack, unpack def sha1(data): """ Returns the SHA1 sum as a 40-character hex string """ h0 = 0x67452301 h1 = 0xEFCDAB89 h2 = 0x98BADCFE h3 = 0x10325476 h4 = 0xC3D2E1F0 def rol(n, b): return ((n << b) | (n >> (32 - b))) & 0xffffffff # After the data, append a '1' bit, then pad data to a multiple of 64 bytes # (512 bits). The last 64 bits must contain the length of the original # string in bits, so leave room for that (adding a whole padding block if # necessary). padding = chr(128) + chr(0) * (55 - len(data) % 64) if len(data) % 64 > 55: padding += chr(0) * (64 + 55 - len(data) % 64) padded_data = data + padding + pack('>Q', 8 * len(data)) thunks = [padded_data[i:i+64] for i in range(0, len(padded_data), 64)] for thunk in thunks: w = list(unpack('>16L', thunk)) + [0] * 64 for i in range(16, 80): w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1) a, b, c, d, e = h0, h1, h2, h3, h4 # Main loop for i in range(0, 80): if 0 <= i < 20: f = (b & c) | ((~b) & d) k = 0x5A827999 elif 20 <= i < 40: f = b ^ c ^ d k = 0x6ED9EBA1 elif 40 <= i < 60: f = (b & c) | (b & d) | (c & d) k = 0x8F1BBCDC elif 60 <= i < 80: f = b ^ c ^ d k = 0xCA62C1D6 a, b, c, d, e = rol(a, 5) + f + e + k + w[i] & 0xffffffff, \ a, rol(b, 30), c, d h0 = h0 + a & 0xffffffff h1 = h1 + b & 0xffffffff h2 = h2 + c & 0xffffffff h3 = h3 + d & 0xffffffff h4 = h4 + e & 0xffffffff return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4) 
    \$\endgroup\$
    2
    • 3
      \$\begingroup\$You could avoid the if in the calculation of the padding if you wrote 63 - (len(data) + 8) % 64.\$\endgroup\$CommentedDec 18, 2013 at 12:14
    • \$\begingroup\$FYI same algorithm using the hashlib.sha1 streaming interface (.update and .digest methods): github.com/pts/tinyveracrypt/blob/…\$\endgroup\$
      – pts
      CommentedOct 31, 2019 at 20:06
    6
    \$\begingroup\$
    1. There's no documentation. What does this code do and how am I supposed to use it?

    2. There are no test cases. Since the hashlib module has a sha1 function, it would be easy to write a test case that generates some random data and hashes it with both algorithms. For example:

      import hashlib from random import randrange from unittest import TestCase class TestSha1(TestCase): def test_sha1(self): for l in range(129): data = bytearray(randrange(256) for _ in range(l)) self.assertEqual(hashlib.sha1(data).hexdigest(), sha1(data)) 
    3. You start by converting the message to a string containing its representation in binary. But then later on you convert it back to integers again. It would be simpler and faster to work with bytes throughout. Python provides the bytearray type for manipulating sequences of bytes and the struct module for interpreting sequences of bytes as binary data. So you could initialize and pad the message like this:

      message = bytearray(data) # Append a 1-bit. message.append(0x80) # Pad with zeroes until message length in bytes is 56 (mod 64). message.extend([0] * (63 - (len(message) + 7) % 64)) # Append the original length (big-endian, 64 bits). message.extend(struct.pack('>Q', len(data) * 8)) 

      and then convert the chunks to 32-bit integers using struct.unpack like this:

      for chunk in range(0, len(message), 64): w = list(struct.unpack('>16L', message[chunk: chunk+64])) for i in range(16, 80): w.append(rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1)) 

      This runs substantially faster than your code.

    4. I would write the conditions in the main loop as 0 <= i < 20, 20 <= i < 40 and so on, to match the half-open intervals that Python uses elsewhere (in list slices, range and so on).

    5. I would replace:

      elif 60 <= i <= 79: 

      by

      else: assert 60 <= i < 80 

      to make it clear that this case must match.

    6. You can avoid the need for temp by using a tuple assignment:

      a, b, c, d, e = ((rol(a, 5) + f + e + k + w[i]) & 0xffffffff, a, rol(b, 30), c, d) 
    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.