8
\$\begingroup\$

I wrote a Python (2.7) script that compares a file byte by byte. filecmp was not suitable for the application because it compares metadata as well. How can I improve this code?

def byte_comp(fname, fname2): """ Compares two files byte by byte Returns True if the files match, false otherwise """ read_size = 1048576 # Determined experimentally if os.path.getsize(fname) != os.path.getsize(fname2): # If two files do not have the same size, they cannot be same return False with open(fname, "rb") as f: with open(fname2, "rb") as f2: count = 0 # Counts how many bytes have been read while count <= os.path.getsize(fname): # Loops until the whole file has been read if(f.read(read_size) != f2.read(read_size)): # If the chunk of the file is not the same, the function returns false return False count += read_size return True # If the files are the same, it returns True 

I would also appreciate help on how to make the function faster and less CPU intensive.

\$\endgroup\$
2
  • 2
    \$\begingroup\$Is this Python 2 or Python 3? I'm asking because they offer different (buffered) byte I/O interfaces.\$\endgroup\$CommentedJul 24, 2017 at 8:21
  • \$\begingroup\$@DavidFoerster Python 2\$\endgroup\$
    – jkd
    CommentedJul 24, 2017 at 8:24

2 Answers 2

8
\$\begingroup\$

Suggestions:

  • Run this through a PEP8 linter. One thing that sticks out - two spaces before inline comments.
  • Combine your withs: with open(fname, 'rb') as f, open(fname2, 'rb') as f2
  • The count/increment/compare approach is awkward and a C smell. Instead of that, simply read until you get an empty array, which indicates an EOF.

"resource-intensive" depends on which resource. If you decrease your read_size you might save a little on memory, but on most system save for embedded ones this will be negligible and your buffer size is fine.

If your constraining resource is time/CPU occupation, you could potentially parallelize this by reading both files at once and periodically joining to compare buffers. This might actually be harmful for performance if it's a hard disc since switching between files is slow, but it might help if you have an SSD. Testing is the only good way to know for sure.

Depending on a few factors, it might be faster not to do simultaneous I/O at all, and instead to compute a checksum or hash for the first file, second file, then compare the two. This has the advantage of greatly reducing the number of actual comparisons, and doing I/O in a more serial manner, but has the disadvantage of not being able to early-fail in the middle of a file if a difference is found.

All of this aside: unless your files are on the order of GB, this is likely premature optimization and isn't worth the effort.

\$\endgroup\$
1
  • 1
    \$\begingroup\$Thanks. I meant CPU-intensive as I have more than enough memory. I have a HDD so I won't parallelize it. While the files aren't large, there are a lot of files to compare.\$\endgroup\$
    – jkd
    CommentedJul 24, 2017 at 4:41
11
\$\begingroup\$
  1. The code in the post calls os.path.getsize for each block read from the two files. This could be avoided by remembering the value in a local variable.

  2. The code accumulates count in order to detect the end of the files being compared. But the read method will return an empty bytes object when the end of file is reached, so there is no need to maintain a count.

    If you look at the implementation of the filecmp module, you'll see that file comparison is implemented like this:

    def _do_cmp(f1, f2): bufsize = BUFSIZE with open(f1, 'rb') as fp1, open(f2, 'rb') as fp2: while True: b1 = fp1.read(bufsize) b2 = fp2.read(bufsize) if b1 != b2: return False if not b1: return True 
\$\endgroup\$
6
  • \$\begingroup\$Thank you. I didn't realize that was how filecmp implemented it. The documentation said something about checking os.stat and I didn't want that. I had problems with it always returning False even when I set shallow=False. I'll try using filecmp again.\$\endgroup\$
    – jkd
    CommentedJul 24, 2017 at 7:23
  • 2
    \$\begingroup\$The documentation is right — filecmp.cmp calls os.stat on the two files, and if the results are equal it then calls _do_cmp.\$\endgroup\$CommentedJul 24, 2017 at 7:24
  • 3
    \$\begingroup\$In fact the file may have no size at all (yet still contain a finite byte stream) if it happens to be a FIFO or a special character device.\$\endgroup\$CommentedJul 24, 2017 at 8:17
  • \$\begingroup\$@jakekimdsΨ Comparing stat output is much faster than resorting to byte-by-byte comparison. The faster a difference can be found between the files, the better.\$\endgroup\$
    – Alexander
    CommentedJul 24, 2017 at 16:44
  • 2
    \$\begingroup\$@jakekimdsΨ Not all stat metadata is compared. Total size is the most important thing to check. If there's different, then the byte-for-byte comparison can be completely short circuited. Creation/modification dates and other stuff like that are ignored\$\endgroup\$
    – Alexander
    CommentedJul 24, 2017 at 17:02

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.