4
\$\begingroup\$

I am looking for feedback on the function below to load a png image stored as bytes from a MongoDB into numpy arrays.

from PIL import Image import numpy as np def bytes_to_matricies(image_bytes): """image bytes into Pillow image object image_bytes: image bytes accessed from Mongodb """ raw_image = Image.open(io.BytesIO(image_bytes)) greyscale_matrix = np.array(raw_image.convert("L")) color_matrix = np.array(raw_image.convert("RGB")) n = greyscale_matrix.shape[0] m = greyscale_matrix.shape[1] return greyscale_matrix, color_matrix, n, m 

I have profiled my code with cProfile and found this function to be a big bottleneck. Any way to optimise it would be great. Note, I have compiled most of the project with Cython, which is why you'll see .pyx files. This hasn't affected much.

 Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 72 331.537 4.605 338.226 4.698 cleaner.pyx:154(clean_image) 1 139.401 139.401 139.401 139.401 {built-in method builtins.input} 356 31.144 0.087 31.144 0.087 {method 'recv_into' of '_socket.socket' objects} 11253 15.421 0.001 15.421 0.001 {method 'encode' of 'ImagingEncoder' objects} 706 10.561 0.015 10.561 0.015 {method 'decode' of 'ImagingDecoder' objects} 72 5.044 0.070 5.047 0.070 {built-in method scipy.ndimage._ni_label._label} 7853 0.881 0.000 0.881 0.000 cleaner.pyx:216(is_period) 72 0.844 0.012 1.266 0.018 cleaner.pyx:349(get_binarized_matrix) 72 0.802 0.011 0.802 0.011 {method 'convert' of 'ImagingCore' objects} 72 0.786 0.011 13.167 0.183 cleaner.pyx:57(bytes_to_matricies) 

If you are wondering how the images are encoded before being written into the MongoDB here is that code:

def get_encoded_image(filename: str): """Binary encodes image. """ image = filesystem_io.read_as_pillow(filename) # Just reads file on disk into PILLOW Image object stream = io.BytesIO() image.save(stream, format='PNG') encoded_string = stream.getvalue() return encoded_string # This will be written to MongoDB 

Things I have tried:

  1. As mentioned above I tried compiling with Cython
  2. I have tried to use the lycon library but could not see how to load from bytes.
  3. I have tried using Pillow SIMD. It made things slower.
  4. I am able to use multiprocessing. But I want to optimise the function before I parallalize it.

Thank you!

UPDATE: Answer to questions from Reinderein: The images are photographs of documents. They will eventually be OCR'd. I'm not sure how a lossy compression would affect the OCR quality. DPI is 320. Size on disk ~ 800kb each.

\$\endgroup\$
5
  • \$\begingroup\$What is the nature of your input image - size in DB, pixel dimensions, content? Is it something like a graph (lots of continuous colour regions) or a photograph? What is the reason to encode it in PNG? Does your image strictly need to be lossless?\$\endgroup\$CommentedAug 21, 2021 at 18:00
  • \$\begingroup\$@Reinderien I have updated the question with answers to the above.\$\endgroup\$
    – Neil
    CommentedAug 21, 2021 at 19:45
  • 1
    \$\begingroup\$One improvement opportunity, I see is to use YCbCr format rather than RGB. It will save you cost of conversion to grey because Y channel can directly give you grey image.\$\endgroup\$
    – nkvns
    CommentedAug 22, 2021 at 7:12
  • 2
    \$\begingroup\$I removed my previous answer because the results were misleading. After running your code step by step using the codetiming lib it seems that the bottleneck lies in the grayscale conversion (raw_image.convert("L")). But I believe that test is itself misleading - it is as if subsequent calls to .convert in Pillow were benefiting from some sort of cache. It's also possible the picture I used does not yield representative results. Numpy might be used for grayscale conversion instead of PIL - see: e2eml.school/convert_rgb_to_grayscale.html\$\endgroup\$
    – Kate
    CommentedAug 22, 2021 at 15:42
  • 1
    \$\begingroup\$@Neil for us to test your code (as you have it), we would need some example images. Do you have the code plus sample images, perhaps on github? After we have that, we can run tests in an attempt to optimise. If the images are confidential, see if you can find others similar in form.\$\endgroup\$
    – C. Harley
    CommentedSep 7, 2021 at 23:04

1 Answer 1

5
\$\begingroup\$

matricies is not a word.

A crucial step in your pipeline, and one you have only implied, is the actual blob-loading from MongoDB. Let's assume that you use pymongo. You should be using bson.binary and not some intermediate representation like base64. The binary subtype should probably be byte.

To make a reference image, I severally copy-pasted screenshots of your question text into GIMP and exported the result as a PNG with these settings:

PNG export settings

At your stated 320 DPI, and assuming 8.5"x11", this produces

$ exiftool document.png ExifTool Version Number : 12.40 File Name : document.png Directory : . File Size : 718 KiB File Modification Date/Time : 2024:11:24 11:16:27-05:00 File Access Date/Time : 2024:11:24 11:17:06-05:00 File Inode Change Date/Time : 2024:11:24 11:16:27-05:00 File Permissions : -rw-rw-r-- File Type : PNG File Type Extension : png MIME Type : image/png Image Width : 2720 Image Height : 3520 Bit Depth : 8 Color Type : RGB Compression : Deflate/Inflate Filter : Adaptive Interlace : Noninterlaced Background Color : 0 0 0 Pixels Per Unit X : 12598 Pixels Per Unit Y : 12598 Pixel Units : meters Image Size : 2720x3520 Megapixels : 9.6 

with a similar size to your ~800 KiB.

Since you care about mode L, the first and most obvious optimisation is to actually use that in your database. Again using the reference image I made, and switching to these settings:

greyscale

we get

$ exiftool document.png ExifTool Version Number : 12.40 File Name : document.png Directory : . File Size : 291 KiB File Modification Date/Time : 2024:11:24 11:25:41-05:00 File Access Date/Time : 2024:11:24 11:17:06-05:00 File Inode Change Date/Time : 2024:11:24 11:25:41-05:00 File Permissions : -rw-rw-r-- File Type : PNG File Type Extension : png MIME Type : image/png Image Width : 2720 Image Height : 3520 Bit Depth : 8 Color Type : Grayscale Compression : Deflate/Inflate Filter : Adaptive Interlace : Noninterlaced Background Color : 0 Pixels Per Unit X : 12598 Pixels Per Unit Y : 12598 Pixel Units : meters Image Size : 2720x3520 Megapixels : 9.6 

This should fully obviate the first call to convert() and takes 59% less space.

I find it strange that your bytes_to_matrices returns both colour and greyscale images. If you really, really need the colour image as well (OCR benefiting from that is dubious) - and if the conversion is the bottleneck - then you can pursue a similar strategy where you save a second copy of the image in RGB8 format. The benefit may be blunted if e.g. there's a network hop to your database or your database hard drive is slow.

Another strategy to benchmark is to remove compression altogether. This will trade time for space; the image will take more database space but will hopefully be faster for PIL to load. Try BMP.

\$\endgroup\$
0

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.