I'm fixed on the save file format. There are only three types of data:
long
(NumPyint32
array, independent of platform)double
(NumPydouble
array)string
(UTF-8, but ord(c) < 128 for all characters)
whereby the first two get saved in their binary representation. Strings get saved as string and are preceded with an int32
with its length (not fixed along one column).
I have input data of the form:
bun = {'x': np.zeros(n_col), 'y': np.zeros(n_col, dtype=np.int32), 'z': np.zeros(n_col), 'u': ['abc'] * n_col}
And I need to save it line-by-line.
Pseudo code:
for i in range(n_col): for col in ordered_colum_names: save(bun[col][i])
As so many file IO's can get costly (yes, I tried it first and I have up to 1M lines) and the files are limited to be below 150MB I decided to write everything into a bytes buffer and then write in one go to the disk. But I feel my code could be simplified. Please let me know what your thoughts about it.
import io from collections import OrderedDict import numpy as np ############################### ### Functions to comment on ### ############################### def join(*args): return b''.join(args[0]) def array2binary_list(bin): """ Converts an numpy array into an list of bytes. !!! Note returns generator !!! """ return map(bytes, zip(*[iter(bin.tobytes())] * bin.itemsize)) def write_binary_col(fid, bun, col): """ Write down the data in the variable space format. For int32 and double this means just the binary. For str this means ascii representation preceded with the str length in int32 """ raw = [] for key, val in col.items(): if (val['type'] == 'double') or (val['type'] == 'long'): raw.append(array2binary_list(bun[key])) elif val['type'] == 'string': str_len = np.array([len(line) for line in bun[key]], dtype=np.int32) raw.append(array2binary_list(str_len)) raw.append(line.encode('ascii') for line in bun[key]) else: raise TypeError('Unknown Type {}'.format(val['type'])) # Transpose and serialize it first fid.write(b''.join(map(join, zip(*raw)))) ######################################################### ### Some helper functions to illustrate the problem ### ######################################################### def create_data(n_col): """ This generates just some fake data """ bun = {'x': np.zeros(n_col), 'y': np.zeros(n_col, dtype=np.int32), 'z': np.zeros(n_col), 'u': ['abc'] * n_col} col = OrderedDict() col['x'] = {'type': 'double'} col['y'] = {'type': 'long'} col['z'] = {'type': 'double'} col['u'] = {'type': 'string'} return (bun, col) if __name__ == '__main__': fid = io.BytesIO() bun, col = create_data(100) write_binary_col(fid, bun, col) print(fid.getvalue())