I am looking for feedback on the function below to load a png image stored as bytes from a MongoDB into numpy arrays.
from PIL import Image import numpy as np def bytes_to_matricies(image_bytes): """image bytes into Pillow image object image_bytes: image bytes accessed from Mongodb """ raw_image = Image.open(io.BytesIO(image_bytes)) greyscale_matrix = np.array(raw_image.convert("L")) color_matrix = np.array(raw_image.convert("RGB")) n = greyscale_matrix.shape[0] m = greyscale_matrix.shape[1] return greyscale_matrix, color_matrix, n, m
I have profiled my code with cProfile and found this function to be a big bottleneck. Any way to optimise it would be great. Note, I have compiled most of the project with Cython, which is why you'll see .pyx files. This hasn't affected much.
Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 72 331.537 4.605 338.226 4.698 cleaner.pyx:154(clean_image) 1 139.401 139.401 139.401 139.401 {built-in method builtins.input} 356 31.144 0.087 31.144 0.087 {method 'recv_into' of '_socket.socket' objects} 11253 15.421 0.001 15.421 0.001 {method 'encode' of 'ImagingEncoder' objects} 706 10.561 0.015 10.561 0.015 {method 'decode' of 'ImagingDecoder' objects} 72 5.044 0.070 5.047 0.070 {built-in method scipy.ndimage._ni_label._label} 7853 0.881 0.000 0.881 0.000 cleaner.pyx:216(is_period) 72 0.844 0.012 1.266 0.018 cleaner.pyx:349(get_binarized_matrix) 72 0.802 0.011 0.802 0.011 {method 'convert' of 'ImagingCore' objects} 72 0.786 0.011 13.167 0.183 cleaner.pyx:57(bytes_to_matricies)
If you are wondering how the images are encoded before being written into the MongoDB here is that code:
def get_encoded_image(filename: str): """Binary encodes image. """ image = filesystem_io.read_as_pillow(filename) # Just reads file on disk into PILLOW Image object stream = io.BytesIO() image.save(stream, format='PNG') encoded_string = stream.getvalue() return encoded_string # This will be written to MongoDB
Things I have tried:
- As mentioned above I tried compiling with Cython
- I have tried to use the lycon library but could not see how to load from bytes.
- I have tried using Pillow SIMD. It made things slower.
- I am able to use multiprocessing. But I want to optimise the function before I parallalize it.
Thank you!
UPDATE: Answer to questions from Reinderein: The images are photographs of documents. They will eventually be OCR'd. I'm not sure how a lossy compression would affect the OCR quality. DPI is 320. Size on disk ~ 800kb each.
raw_image.convert("L")
). But I believe that test is itself misleading - it is as if subsequent calls to .convert in Pillow were benefiting from some sort of cache. It's also possible the picture I used does not yield representative results. Numpy might be used for grayscale conversion instead of PIL - see: e2eml.school/convert_rgb_to_grayscale.html\$\endgroup\$