Class handling file conversion and data manipulation archiving [h5py]

Question

I am a physics PhD student who co-developped a python package for file conversion of standard file format used in Scanning Probe Microscopy (SPM) and archiving of data manipulation for tracability. If you want to know more about the package, an article was published there: https://doi.org/10.1016/j.ultramic.2021.113345

In summary, the goal of the project is to have a well defined file structure inside an hdf5 file which will contain three root folder (h5py.Groups): datasets, metadata and process. Datasets will contain raw data extracted from an existing file in a standard SPM format; metadata will contain the existing metadata of the existing file; and process will contain all the data manipulation that have been done on the raw data (or existing processed data), with attributes saving all the necessary information for data manipulation tracability.

The initial version of the code was built without thinking too much about best practices and expandability in mind, so I am currently doing a "big rewrite".

I am almost done with a class called hyFile which is a wrapper class around h5py.File which will handle:

The file conversion (from a standard SPM format to hdf5)
The reading and writting of data from the hdf5 file (using the existing package h5py)
The tracability of processed data through the apply function, which take an arbitrary function and the path to the data that need processing, and save the result into the process folder.

It contains all the necessary dunder methods to write/read data in the hdf5 file, and to be a context manager. It also contains an internal class which allows the manipulation of hdf5 attributes.

import h5py import pathlib import warnings from . import ibw_files, gsf_files import inspect import numpy as np from datetime import datetime import types class hyFile: class Attributes: def __init__(self, file): self.file = file def __contains__(self, path: str) -> bool: return path in self.file def __getitem__(self, path: str | None = None) -> dict: if path is None: f = self.file[""] if path != "": f = self.file[path] return {key: f.attrs[key] for key in f.attrs.keys()} def __setitem__(self, path: str, attribute: tuple[str, any]) -> None: self.file[path].attrs[attribute[0]] = attribute[1] def __init__(self, path, mode="r+"): if isinstance(path, str): self.path = pathlib.Path(path) else: self.path = path if not self.path.is_file(): self.file = h5py.File(self.path, "a") for group in ["datasets", "metadata", "process"]: self._require_group(group) if mode != "a": self.file.close() self.file = h5py.File(self.path, mode) else: self.file = h5py.File(self.path, mode) root_struct = set(self.file.keys()) if root_struct != {"datasets", "metadata", "process"}: warnings.warn( f"Root structure of the hdf5 file is not composed of 'datasets', 'metadata' and 'process'. \n It may not have been created with Hystorian." ) self.attrs = self.Attributes(self.file) def __enter__(self): return self def __exit__(self, type, value, traceback) -> None: if self.file: self.file.close if value is not None: warnings.warn(f"File closed with the following error: {type} - {value} \n {traceback}") def __getitem__(self, path: str = "") -> h5py.Group | h5py.Dataset: if path == "": return self.file else: return self.file[path] def __setitem__(self, data: tuple[str, any], overwrite=True) -> None: self._create_dataset(data, overwrite) def __delitem__(self, path: str) -> None: if path not in self.file: raise KeyError("Path does not exist in the file.") del self.file[path] def __contains__(self, path: str) -> bool: return path in self.file def read(self, path: str = ""): if path == "": return self.file.keys() else: if isinstance(self.file[path], h5py.Group): return self.file[path].keys() else: return self.file[path][()] def extract_data(self, path: str): conversion_fcts = { ".gsf": gsf_files.extract_gsf, ".ibw": ibw_files.extract_ibw, } if isinstance(path, str): path = pathlib.Path(path) if path.suffix in conversion_fcts: data, metadata, attributes = conversion_fcts[path.suffix](path) self._write_extracted_data(path, data, metadata, attributes) else: raise TypeError(f"{path.suffix} file don't have a conversion function.") def apply( self, function: callable, inputs: list[str] | str, output_names: list[str] | str | None = None, increment_proc: bool = True, **kwargs, ): def convert_to_list(inputs): if isinstance(inputs, list): return inputs return [inputs] inputs = convert_to_list(inputs) if output_names is None: output_names = inputs[0].rsplit("/", 1)[1] output_names = convert_to_list(output_names) result = function(*inputs, **kwargs) if result is None: return None if not isinstance(result, tuple): result = tuple([result]) if len(output_names) != len(result): raise Exception("Error: Unequal amount of outputs and output names") num_proc = len(self.read("process")) if increment_proc or f"{str(num_proc).zfill(3)}-{function.__name__}" not in self.read("process"): num_proc += 1 out_folder_location = f"{'process'}/{str(num_proc).zfill(3)}-{function.__name__}" for name, data in zip(output_names, result): self._create_dataset((f"{out_folder_location}/{name}", data)) self._write_generic_attributes(f"{out_folder_location}/{name}", inputs, name, function) self._write_kwargs_as_attributes(f"{out_folder_location}/{name}", function, kwargs, first_kwarg=len(inputs)) def _write_generic_attributes( self, out_folder_location: str, in_paths: list[str] | str, output_name: str, function: callable ) -> None: if not isinstance(in_paths, list): in_paths = [in_paths] operation_name = out_folder_location.split("/")[1] new_attrs = { "path": out_folder_location + output_name, "shape": np.shape(self[out_folder_location]), "name": output_name, } if function.__module__ is None: new_attrs["operation name"] = "None." + function.__name__ else: new_attrs["operation name"] = function.__module__ + "." + function.__name__ if function.__module__ == "__main__": new_attrs["function code"] = inspect.getsource(function) new_attrs["operation number"] = operation_name.split("-")[0] new_attrs["time"] = str(datetime.datetime.now()) new_attrs["source"] = in_paths self.attrs[out_folder_location] = new_attrs def _write_kwargs_as_attributes( self, path: str, func: callable, all_variables: dict[str, any], first_kwarg: int = 1 ) -> None: attr_dict = {} if isinstance(func, types.BuiltinFunctionType): attr_dict["BuiltinFunctionType"] = True else: signature = inspect.signature(func).parameters var_names = list(signature.keys())[first_kwarg:] for key in var_names: if key in all_variables: value = all_variables[key] elif isinstance(signature[key].default, np._globals._NoValueType): value = "None" else: value = signature[key].default if callable(value): value = value.__name__ elif value is None: value = "None" try: attr_dict[f"kwargs_{key}"] = value except RuntimeError: RuntimeWarning("Attribute was not able to be saved, probably because the attribute is too large") attr_dict["kwargs_" + key] = "None" self.attrs[path] = attr_dict def _require_group(self, name: str): self.file.require_group(name) def _create_dataset(self, data: tuple[str, any], overwrite=True) -> None: if data[0] in self.file: if overwrite: del self.file[data[0]] else: raise KeyError("Key already exist and overwriste is set to False.") self.file.create_dataset(data[0], data=data[1]) def _write_extracted_data(self, path, data, metadata, attributes) -> None: self._require_group(f"datasets/{path.stem}") for d_key, d_value in data.items(): self._create_dataset((f"datasets/{path.stem}/{d_key}", d_value), overwrite=True) for attr in attributes[d_key].items(): self.attrs[f"datasets/{path.stem}/{d_key}"] = attr if isinstance(metadata, dict): for key in metadata: self._create_dataset((f"metadata/{path.stem}/{key}", metadata[key])) else: self._create_dataset((f"metadata/{path.stem}", metadata))

Here is a few of the questions that I have.

As it is right now, the class is a context manager, which open the hdf5 file at the creation of the object and it stays open as long as the object exist. It simplifies a lot the manipulation of it, since then I can just pass the file object along. However would this possibly create conflict with the os trying to access the file?
I have the feeling that the class is doing too much at the same time: creating file, read/write, conversion, data manipulation. However I am not sure how I would split it into smaller chunck
I am not very happy with the apply function. As it is know, it takes a function, the inputs for it, the name of the outputs to be saved, and a list of **kwargs that are passed to the function. I am not sure how to make it as general as possible but not bloated: In theory any kind of function could be passed to it, in practice most of the one passed will manipulate some kind of numpy array, and give as a result also a numpy array.
I am also not sure about the extract_data it contains a dict of function that extract data for multiple file formats and always return three dictionnaries : data, metadata and attributes. Is this a good way to do it?

J_H · Accepted Answer · 2023-04-29 18:52:15Z

Thank you for your work to advance how folks take images of tiny things!

I'd like to comment on something outside of the submission:

def m_apply(filename, function, ...): """ Take any function and handles the inputs of the function by looking into the hdf5 file Also write the output of the function into the hdf5.

I found the project documentation frustrating, including those two sentences. At first blush it appeared to me that m_apply was central to the project, a core concept that I should understand at once before moving on to other details.

I eventually formed the opinion that I should probably be mentally pronouncing it "matrix apply", that is, applying some function to a named input matrix. But I am sad that neither the tutorial, "readthedocs" text, nor this docstring ever mentions that.

Moving on, both the tutorial and the (short!) source code explained to me that l_apply should be read as "list apply", which repeatedly invokes m_apply. So that much is kind of cool. Though again, the """docstring""" could be improved.

Ok, back to the submitted code.

You have a bunch of imports up front. Please run isort *.py to clean them up. Why does conforming to pep-8 matter? Because it makes it easier for the Gentle Reader to see what's going on. Plus, it reduces merge conflicts when two separate feature branches each add their own new deps.

It would be more convenient to phrase it this way: from pathlib import Path

class hyFile:

No. Please spell it HyFile. Do not create a Public API with a misspelled class name.

 def __init__(self, file):

This is astonishing. Perhaps file is the appropriate identifier, I'm not certain about that topic yet.

But if you're going to call it file, then you absolutely must add an "optional" type annotation to this signature. As written, I expected that it meant something like

 def __init__(self, file: Path):

or

 def __init__(self, file: str):

or conceivably some open file handle. Explaining it in a """docstring""" would not be enough -- you need to annotate the file: dict type if you want clear communication with your readers.

 def __contains__(self, path: str) -> bool:

I am a little sad we don't see path: Path, but OK, whatever.

 def __getitem__(self, path: str | None = None) -> dict:

This is lovely. Maybe consider using the recently introduced Optional[str] notation. GvR claims "there should be one obvious way to do it!", and I think that's the form currently preferred.

 if path is None: f = self.file[""] if path != "": f = self.file[path] return {key: f.attrs[key] for key in f.attrs.keys()}

Ok, the technical term for that is "pretty bad".

You seem to contemplate that caller might pass in the empty string. And then, when that happens, you dereference uninitialized local f, giving a NameError.

I imagine that linting with mypy might have caught that? Haven't tried it yet.

 self.file[path].attrs[attribute[0]] = attribute[1]

This is nice enough. Certainly it works.

It would be nicer if you introduced names for those, via tuple unpack:

 k, v = attribute self.file[path].attrs[k] = v

(Or key, val, whatever.)

The __init__ constructor is doing kind of a lot of stuff there. Maybe banish some of it to a helper? or two?

Think about adding some method .foo() to this class, either you or another maintainer. And think about authoring a new unit test. How hard do we want to make it to add such a test? We need an H5 file before we can even begin to think about testing some utility function. :-(

The sticking point seems to be having a valid self.file. Maybe require caller to pass it in? A utility function outside the class could help callers with that. Alternatively, we might have a pair of similar classes, one which accepts a str name, the other accepting an existing file Path.

 if isinstance(path, str): self.path = pathlib.Path(path) else: self.path = path

This seems tediously long-winded.

Just unconditionally assign self.path = Path(path) and be done with it. And similarly in .extract_data().

A redundant call of str("abc") will similarly return those three characters unchanged.

We have if NOT .is_file(): A else: B. Consider rephrasing it as if .is_file(): B else: A. Humans are better at reasoning about code that lacks negation. (And double negative? Don't even go there.)

This .is_file() test does not seem robust, in the sense that H5 and FS have different namespaces, and there could accidentally be a name collision between them. So again, allowing the caller to choose his own adventure might be the better route to follow.

A maintainer would probably be grateful if you mention the value of f"{root_structure}" when you .warn().

 def __enter__(self): return self

Oh, nice, we have a context manager!

The class """docstring""" probably should have spelled that out.

Certainly this definition is correct. But it's not usual to define such a method, as the contextmanager decorator typically suffices.

 if self.file: self.file.close

I don't understand what's going on here at all.

It seems like Author's Intent must have been .close() ?

 def __exit__(self, type, value, traceback) -> None:

I don't understand this signature. Or rather, I don't agree with it, we could easily make it more closely match the documentation.

Please don't shadow the builtin type. Prefer to "be boring"; prefer the exc_type mentioned in the docs.

 if value is not None:

This is actually pretty bad. I have participated in many code reviews where hopelessly vague identifiers like this were criticized, though in some contexts a name of value makes perfect sense. Here? Absolutely not! You must use the conventional name of exc_value so it is clear we're dealing with propagation of an EXCeption.

Ok, let's come back to that signature. It's annotated to return None, that is, we evaluate for side effects. What?!? The documentation's mention of "should return a true value" makes it clear that ...) -> bool: is the appropriate return type. Did mypy really agree with this code? Sigh!

(Yes, I understand, dropping off the end of this method implicitly returns None which is similar to return False. Typically a decorator glosses over these details.)

 def __getitem__(self, path: str = "") -> h5py.Group | h5py.Dataset:

Kudos, that's a very helpful signature, thank you.

 if path == "":

Ok, I'm starting to get the feeling that empty string is special to this class, in the way that None is often a special input parameter to lots of other classes. And that's fine, it may be good design.

But it has to be explicitly documented. I haven't seen any """docstring""" call out its special meaning.

I don't know the right answer, but it is possible that you might find it convenient and self-documenting to nominate some other special reserved string to play that role. This is all part of the design of a Public API, and I will be the first to confess that (A) it ain't easy, and (B) likely v2 will get it "more" right than the v1 design.

 raise KeyError("Path does not exist in the file.")

Some poor maintainer will read a bug report and then curse you. Prefer:

 raise KeyError("Path {path} does not exist in the file.")

 return self.file[path][()]

I get the sense we will be returning an h5py.Dataset here. But I'd feel better about this code if an if or an assert was explicit about that.

Also, I guess I'm not really understanding what's going on with the optional type annotations in this class. I see them on some methods. Indeed, we see that this method assures us path shall be a str. But what of the return type? I would like to be able to better reason about path and its de-reference.

In general, DbC tells us that every method adheres to a loose or a tight contract. Here, I am left wondering what the read() contract is.

 conversion_fcts = {

I mentally read that as "conversion facts". Had I authored the code I likely would have named it conversion_fns. Consider spelling out conversion_functions.

It seems pretty clear that extract_data() seeks to impose some post-condition. It is important to explicitly write down that promise.

nit, typo: "doesn't" for "...file don't have..."

 def apply( self, function: callable, inputs: list[str] | str, output_names: list[str] | str | None = None,

Thank you for the type annotations. They are clear, beautiful, very helpful.

As far as design of Public API goes, I'm not super fond of the whole "could be a list, or maybe a str" thing. Maybe that's the right design, maybe you have it exactly right, and I should pipe down, that's fine. Or maybe your users would be OK with having a singular form, and a plural form, of that method.

The concern is this. We want to deliberately design a Public API that is "easy to use", which also means it is "hard to misuse", hard to misinterpret the result if bad things happened along the way. When we admit inputs of diverse types, maybe we are doing the caller a kindness. For example, viewing path: str as equivalent to Path(path) seems harmless enough. The flip side is, insisting on a specific type can reveal bugs as soon as they're authored. For example, a caller of def foo(a, b, c, d, e, f, path=None, g=None): might have 1 too many args, or 1 too few. Insisting that path be of type Path can be helpful to an app engineer who cares about calling it correctly, perhaps by running mypy.

In any event the type annotations, such as for output_names, look lovely as written, and I thank you.

 if len(output_names) != len(result): raise Exception("Error: Unequal amount of outputs and output names")

The """docstring""" does not anticipate this possibility, and does not warn the caller of it. How do I know? Because there is no docstring! This is an essential aspect of the Public API, one that must be documented.

When a maintainer chases a bug report on this, the "unequal" remark will be less helpful than seeing the two integer lengths within the diagnostic.

Do not raise Exception. Either use something more specific like ValueError, or define an app-specific exception class.

 if increment_proc or f"{str(num_proc).zfill(3)}-{function.__name__}" not in self.read("process"):

That's a nice expression.

It needs to be extracted into a helper. I can't imagine that's the only place it appears in this codebase.

 out_folder_location = f"{'process'}/{str(num_proc).zfill(3)}-{function.__name__}"

Oh, look! I was right! Same expression, sure enough.

 for name, data in zip(output_names, result):

When we assigned result = function(*inputs, **kwargs) I had no idea what was going on, it was pretty generic.

But now it appears the identifier should have been results, since the API contract seems to imply some container of more than one result value.

If (e.g. with annotations) you can encourage function to return a certain kind of result value, then please do that. Being specific is better than being vague. It helps us to reason about the code.

You define out_folder_location, but then you only reference that plus /name. Consider changing how you assign local vars.

 "path": out_folder_location + output_name,

I don't understand that expression.

My reading, which likely is wrong, is that location does not end with a / slash, and name contains no slash. It seems like we want a slash between those two.

Sadly this submission contains no unit tests. An example input / output pair would go a long way to documenting the intended behavior of this line of code.

DRY.

In _write_generic_attributes for the pair of assignments ending with

 new_attrs["operation name"] = function.__module__ + "." + function.__name__

prefer this single assignment:

 new_attrs["operation name"] = (function.__module__ or "None") + "." + function.__name__

Some of this method's behavior seems "quirky", such as an assignment conditional on this:

 if function.__module__ == "__main__":

Describing such behavior in a """docstring""" would go a long way toward making this class easy-to-use for newly onboarded engineers just starting to work with this project.

Embrace sphinx! Publish a "readthedocs" site.

 if isinstance(func, types.BuiltinFunctionType):

I fail to see why this is of interest. But I'm just flogging the same old horse that carries a banner saying "there's no docs!". You really need to document this code's behavior.

 value = "None"

That str seems a slightly odd choice, compared to assigning None. Ok, whatever. (It's not like a type hint on attr_dict told us to expect str values.)

 RuntimeWarning("Attribute was not able to be saved, probably because the attribute is too large")

I don't understand that line at all.

We were storing to a variable of type dict. I'm not aware of a way to provoke such an error. An illuminating # comment would be helpful.

 attr_dict[f"kwargs_{key}"] = value ... attr_dict["kwargs_" + key] = "None"

Gratuitously using two ways to express the same thing is not a kindness to the Gentle Reader.

 def _create_dataset(self, data: tuple[str, any], overwrite=True) -> None: if data[0] in self.file: ... self.file.create_dataset(data[0], data=data[1])

The identifier data is probably the right name to use, but it is rather vague. And it has structure, as the type annotation explains.

Please use an x, y = data tuple unpack, so you can name the elements. (I would offer multi-character names, except that I don't know what appropriate names would be.)

In _write_extracted_data, you again might want to assign a path temp var. Here it might be f"datasets/{path.stem}".

context manager, ... would this possibly create conflict with the os trying to access the file?

No, it there's no conflict.

But I agree with you that this class seems to be doing more than one thing. If the submission included a short test suite or other example calls we might be able to tease out two (or more!) distinct use cases that several classes could specialize in.

I have the feeling that the class is doing too much

Yes, I'm inclined to agree.

But it's hard to evaluate without seeing the typical call pattern. Unit tests would go a long way toward illuminating that. Possibly a pair of similar classes would make sense, but it's hard for me to tell based on just the OP.

not very happy with the apply function. ... in practice most of the ones passed will manipulate some kind of numpy array, and give as a result also a numpy array.

I completely agree with that sentiment. Again, it comes down to examining how callers actually use the API.

You proposed a pair of restrictions: that a matrix be passed in, and that a matrix be emitted as the result. I agree. Run your test suite, then impose such restrictions, and verify the tests still run Green.

not sure about the extract_data ... Is this a good way to do it?

Yes, it looked like a good way to parameterize things to me.

Again, """docstrings""" and example calls in a test suite will go a long way toward advising new callers about how to take advantage of what your class offers. In particular, you should carefully explain what the contract for a user-defined function is -- how the input matrix relates to the output matrix, must it be a pure function, when to complain with a fatal raise, those sort of concerns.

This code achieves some of its design goals.

Adding unit tests wouldn't hurt. Adding """docstrings""" is essential.

I would not be willing to delegate or accept maintenance tasks on this codebase in its current form. But it goes deeper than that. I would be reluctant to work on code which uses the HyFile class as a dependency. Why? Low quality documentation. As a caller, it is unclear to me what contracts my inputs must adhere to, and what output behavior I can rely on.

EDIT

It turns out some unit tests are lurking within a feature branch.

An attempt import sample (predictably) fails.

Altering sys.path at import time is not a Best Practice. Instead, chdir to the right spot before running, and use the PYTHONPATH env var to affect sys.path.

After $ poetry add pytest-cov I got this to work:

cd hystorian/tests/ && env PYTHONPATH=.. poetry run pytest --cov --cov-report=term-missing test_io.py ====== test session starts ======= platform darwin -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 rootdir: hystorian plugins: typeguard-2.13.3, cov-4.0.0 collected 9 items test_io.py ......... --------- coverage: platform darwin, python 3.10.10-final-0 ---------- Name Stmts Miss Cover Missing ------------------------------------------------------------------- hystorian/hystorian/__init__.py 0 0 100% hystorian/hystorian/io/__init__.py 0 0 100% hystorian/hystorian/io/gsf_files.py 34 4 88% 23-24, 59, 61 hystorian/hystorian/io/hyFile.py 154 72 53% 38, 41-46, 74, 97-98, 109, 113, 135, 139, 141, 152, 158, 168-202, 207, 212-231, 236-261, 272 hystorian/hystorian/io/ibw_files.py 44 39 11% 12-65 test_io.py 83 3 96% 115-116, 133 ------------------------------------------------------------------- TOTAL 315 118 63% ===== 9 passed in 1.45s =====

I would typically record such a command in a Makefile, so $ make test will conveniently run it and so newly onboarded engineers will see it and learn from it. Similarly, $ make lint in any of my projects will tend to run something like isort . && black -S . && flake8, perhaps followed by a mypy invocation.

Sadly, the usual command of $ pytest tests/*.py does not work, since the tests want to be run from one level down.

You will want to add tests/test_files/test.hdf5 to .gitignore

I briefly looked at a few tests and I like them, as they are short, boring, simple, and direct, just what we want to see in tests.

You might wish to assign a popular constant, such as "datasets/test_data", to a variable. (Or not. Copy-n-paste in test code is perfectly fine, it's the one place you can get away with that.)

I have not examined the uncovered lines, e.g. lines 23 and 59, and I would not necessarily shoot for 100% coverage. It's up to you. We spend time writing unit tests in order to save time, both to save an individual contributor time and to save the project time. If you feel covering a line of target code is worthwhile, go for it. If not, that's fine, as well.

I tend to view coverage through this lens: Do I still need that line of target code? Why is it "hard" to call it, to cover it? Could I alter the Public API so testing is easier?

Thanks a lot for your very thorough constructive comment! You will be happy to know that I am writing unittests. You can find them here: github.com/Coilm/hystorian/tree/rework. However, I am pretty sure the current tests I wrote could deserve their own post on this forum since it is the first time in my life that I write testing. I'll implement all your comments and keep them in mind for the future. I am new to StackExchange: What is the best practice if I want to discuss more about some of your comments? — CoilM, CommentedApr 29, 2023 at 7:15
Adding comments here is a fine practice, at least for a couple of back-n-forths. Unlike SO, the CodeReview site tends to frown on edits once an answer has been submitted, so that readers a year from now can see matching review + reviewed code. If you feel it would be helpful, feel free to request review of specific test(s) in a new posting. — J_H, CommentedApr 29, 2023 at 18:29

Stack Exchange Network

Class handling file conversion and data manipulation archiving [h5py]

1 Answer 1

Hot Network Questions

Class handling file conversion and data manipulation archiving [h5py]

1 Answer 1

Related

Hot Network Questions