1
\$\begingroup\$

I'm fixed on the save file format. There are only three types of data:

  • long (NumPy int32 array, independent of platform)
  • double (NumPy double array)
  • string (UTF-8, but ord(c) < 128 for all characters)

whereby the first two get saved in their binary representation. Strings get saved as string and are preceded with an int32 with its length (not fixed along one column).

I have input data of the form:

bun = {'x': np.zeros(n_col), 'y': np.zeros(n_col, dtype=np.int32), 'z': np.zeros(n_col), 'u': ['abc'] * n_col} 

And I need to save it line-by-line.

Pseudo code:

for i in range(n_col): for col in ordered_colum_names: save(bun[col][i]) 

As so many file IO's can get costly (yes, I tried it first and I have up to 1M lines) and the files are limited to be below 150MB I decided to write everything into a bytes buffer and then write in one go to the disk. But I feel my code could be simplified. Please let me know what your thoughts about it.

import io from collections import OrderedDict import numpy as np ############################### ### Functions to comment on ### ############################### def join(*args): return b''.join(args[0]) def array2binary_list(bin): """ Converts an numpy array into an list of bytes. !!! Note returns generator !!! """ return map(bytes, zip(*[iter(bin.tobytes())] * bin.itemsize)) def write_binary_col(fid, bun, col): """ Write down the data in the variable space format. For int32 and double this means just the binary. For str this means ascii representation preceded with the str length in int32 """ raw = [] for key, val in col.items(): if (val['type'] == 'double') or (val['type'] == 'long'): raw.append(array2binary_list(bun[key])) elif val['type'] == 'string': str_len = np.array([len(line) for line in bun[key]], dtype=np.int32) raw.append(array2binary_list(str_len)) raw.append(line.encode('ascii') for line in bun[key]) else: raise TypeError('Unknown Type {}'.format(val['type'])) # Transpose and serialize it first fid.write(b''.join(map(join, zip(*raw)))) ######################################################### ### Some helper functions to illustrate the problem ### ######################################################### def create_data(n_col): """ This generates just some fake data """ bun = {'x': np.zeros(n_col), 'y': np.zeros(n_col, dtype=np.int32), 'z': np.zeros(n_col), 'u': ['abc'] * n_col} col = OrderedDict() col['x'] = {'type': 'double'} col['y'] = {'type': 'long'} col['z'] = {'type': 'double'} col['u'] = {'type': 'string'} return (bun, col) if __name__ == '__main__': fid = io.BytesIO() bun, col = create_data(100) write_binary_col(fid, bun, col) print(fid.getvalue()) 
\$\endgroup\$

    1 Answer 1

    2
    \$\begingroup\$

    Your join function is confusing, why use *args if you're only going to take args[0]? As far as I can see you could just rewrite it like this:

    def join(arg): return b''.join(arg) 

    The only difference is that you can't pass arbitrary unused parameters now, but that would communicate the function's use better. If you have to use this as a work around then you should note why it's necessary with a docstring, since it's not obvious from context and your function just looks like a convenient way to build bytestrings.

    def join(*args): """Return a bytestring of the first argument's elements joined. Will ignore all arguments after the first. This is necessary to deal with the foo bar.""" 

    Semantically, I think it might make sense for you to raise a ValueError for the unknown type error. Backwards as it sounds, the user has passed the correct type (a dictionary) but that dictionary has an invalid value. It doesn't matter that the value indicates a type, the error the user has made is with the values they provided.

    \$\endgroup\$
    1
    • \$\begingroup\$Thank you for your improvements. I somehow tripped over my own tow here because I can strip the entire function thereby making the code more readable and performant. I didn't mention it before but bun is a container class inheriting from dict, which is only allowed to store one of this three types. I can see the argument for ValueError since it is a value of bun which causes the error, but then again its more the type of the value stored rather than the value itself. (docs.python.org/3/library/exceptions.html#ValueError) I have to think about this one a bit more.\$\endgroup\$
      – magu_
      CommentedJan 20, 2016 at 16:13

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.