Python: Function pipeline with multiple return/input values, or use OOP? Best Practices?

Question

I have a 'processing' function and a 'serializing' function. Currently the processor returns 4 different types of data structures to be serialized in different ways.

Looking for the best practise on what to do here.

def process(input): ... return a,b,c,d def serialize(a,b,c): ... # Different serialization patterns for each of a-c. a,b,c,d = process(input) serialize(a,b,c) go_on_to_do_other_things(d)

That feels janky. Should I instead use a class where a,b,c,d are member variables?

class VeryImportantDataProcessor: def process(self,input): self.a = ... self.b = ... ... def serialize(self): s3.write(self.a) convoluted_serialize(self.b) ... vipd = VeryImportantDataProcessor() vipd.process(input) vipd.serialize()

Keen to hear your thoughts on what is best here!

Note after processing and serializing, the code goes on to use variable d for further unrelated shenanigans. Not sure if that changes anything.

Do a, b and c get used outside of process and serialize? Or is the "point" of this code to return d, with serialization of some values as a side effect, and a, b and c migrated to the API by necessity of implementation rather than by design? — trent, CommentedFeb 24, 2021 at 21:32
@trentcl thanks for replying. This is a scheduled spark data-preparation batch-job. Where a, b, and c are processed products of a raw data stream, serialized for other live APIs to pull down for use in their different tasks. The process function here is essentially the SQL-like data manipulation in Spark. After this stage we're done with Spark processing. d is another related subset of the data, but it goes on to additional steps (ML model training) — Jamal Rahman, CommentedFeb 24, 2021 at 22:07
I don't think the class version makes any fundamental difference; it's pretty much the same as the original code, just expressed in a slightly different way. It doesn't give you much (beyond the arguably nicer look). If the bodies of your process and serialize functions were already complicated/inelegant, they are going to remain so in the class version, so it seems to me that all the class would do is sweep a minor perceived issue (of the API feeling janky) under the rug, and leave the larger issue unaddressed. — Filip Milovanović, CommentedFeb 25, 2021 at 17:24
@FilipMilovanović Thanks man. I see that now, the class version is dressing up the same thing just making it over-engineered. I'll take a look at the constituent components and break down any inefficiencies. I appreciate the insight. — Jamal Rahman, CommentedFeb 26, 2021 at 22:54

pcauthorn · Accepted Answer · 2021-02-28 18:54:12Z

I'm not exactly clear on what you are trying to do but like you, I start wondering if there is a better way if a method has many parameters or returns a tuple of more than 2 elements.

Classes in Python are great for encapsulating complexity (I never use them for polymorphism)

I think it aides in understanding if you have classes that wrap the data. In addition this object can change without having to rewrite methods and the method return tuple unpacking.

Something like

class DataStreams: def __init__(self, z, y, z): ... class ProcessOuput: def __init__(self...): ...

Then a method method:

def process(data_streams) -> ProcessOutputs:

As far as the serialization is concerned I would model it after the standard json module.

I'd also make sure that there were multiple sterilizers for distinct types if necessary, i.e. I wouldn't expect a serializer to return a tuple.

Stack Exchange Network

Python: Function pipeline with multiple return/input values, or use OOP? Best Practices?

1 Answer 1

Hot Network Questions

Python: Function pipeline with multiple return/input values, or use OOP? Best Practices?

1 Answer 1

Related

Hot Network Questions