I want to understand what is considered best-practice to better align with OOP when handling relational databases. I cannot find any online examples where classes and a more maintainable/re-usable design is implemented for relational databases.
Say we have a few different relational database (CSV table) files:
- Environment: primary_id (PK), event_id (FK), datetime (SK), weather, temperature, temperature_unit
- Incident: primary_id (PK), event_id (FK), datetime (SK), incident_category, severity, reported_via
- Reporter: primary_id (PK), event_id (FK), name, sex, occupation
where PK: primary keys, FK: foreign keys, SK: secondary keys.
A snapshot of the CSV table files where I display a few records per table as an example:
Environment:
- 1001678125, 10016781, 202401251123, sunny, 19, celsius
- 1024692335, 10246923, 202406120958, cloudy, 15, celsius
Incident:
- 1001678125, 10016781, 202401251123, main_entrance_breach, B, serious, police
- 1024692335, 10246923, 202406120958, side_entrance_breach, C, mild, reception
Reporter:
- 1001678125, 10016781, 202401251123, Joe, M, police officer
- 1024692335, 10246923, 202406120958, Dave, M, reception staff
The goal
The application would read each table, load it in an appropriate dataframe (e.g. pandas
in Python) and have a variety of methods to exercise on a selection of the data records and carry out calculations on it. For example, one calculation method could be to calculate the number of serious
incidents that occurred after 11pm where police was involved.
My idea of having each element (column header) as its own class was to enable the application to present more detailed information and carry out extra calculcations appropriate for each element. For example, the temperature
(in combination with the temperature_unit
) could have class methods to calculate the average temperature, have methods to categorise values (below 0 with a celsius unit to be labelled as "freezing"), etc.
But also, using OOP should enable the application to extend its functionality (more specific methods/calculations) or adapt to cases where the database file might have an extra column added to it.
This is why I would like to use proper Object-Oriented-Programming design and define classes for my key objects.
--
The OOP design I though of is the following:
An Element
parent (base) class to capture the high-level details of what information each element (column header) would hold: name
, abbreviated_name
, value
, description
, data_file_used_in
, column_pos_in_data_file
, notes
.
Create more low-level specific child (derived) classes, one for each specific element (column header), that inherit from the Element
class. For example, a primary_id
class, an event_id
class, a datetime
class, etc. each providing specific values to the Element
class fields.
Where appropriate, create classes for specific data types. For example, the incident_category
element can be one of the following values, A
, B
, or C
.
Then, create a DatabaseFile
parent (base) class to capture the details of each database file: filename
, size
, element(s)
where element(s)
is all of the element column headers found in a database file.
Finally, create child (derived) classes for each of the three specific database files: Environment
, Incident
, and Reporter
. Each will hold specific values for that file as well as list all element (column headers) found in it.
Initially the data from each CSV file will be loaded into an appropriate data type, for example using the pandas
package in Python: df = pandas.read_csv('environment.csv')
and and an instance of the Environment
class will be initialised. The intention is that each child class will hold specific methods to manipulate data.
--
Using Python pseudocode the above would look something like:
class Element: def __init__(self, name, abbreviated_name, value, description, data_file_used_in, column_pos_in_data_file, notes): self.name = name self.value = value self.abbreviated_name = abbreviated_name self.value = value self.description = description self.data_file_used_in = data_file_used_in self.column_pos_in_data_file = column_pos_in_data_file self.notes = notes class Temperature(Element): def __init__(self, name, abbreviated_name, value, description, data_file_used_in, column_pos_in_data_file): Element.__init__(self, name, abbreviated_name, value, description, data_file_used_in, column_pos_in_data_file, notes) class IncidentCategoryCode: VALID_CODES = {"A", "B", "C"} def __init_(self, code): if code not in self.VALID_CODES: raise ValueError() self.code = code class IncidentCategory(Element): def __init__(self, name, abbreviated_name, value, description, data_file_used_in, column_pos_in_data_file, category_code): self.category_code = IncidentCategoryCode(category_code) Element.__init__(self, name, abbreviated_name, value, description, data_file_used_in, column_pos_in_data_file, notes) (...) class DatabaseFile: def __init__(self, filename, size): self.filename = filename self.size = size class Environment(DatabaseFile): def __init__(self, filename, date, size, *elementArgs): self.primary_id = elementArgs[0] self.event_id = elementArgs[1] self.datetime = elementArgs[2] self.weather = elementArgs[3] self.temperature = elementArgs[4] self.temperature_unit = elementArgs[5] DatabaseFile.__init__(self, filename, date, size) (...)
--
Summary
In short, below is a mapping of the classes to the databases/elements reference in the Python pseudocode example:
Element
is a parent class not mapped to a database file.Temperature
is child class of theElement
class that maps to theelement
column header of theEnvironment
database.IncidentCategoryCode is data type for the
IncidentCategory` classIncidentCategory
is a child class of theElement
class that maps to theelement
column header of the incident database.DatabaseFile
is a parent class not mapped to a database file.Environment
is a child class of theDatabaseFile
class that maps to the environment database.
--
Questions
Is the above design/implementation sensible - is there a better way (best practice) to design this? Also, should the primary/secondary/foreign key details be captured in the design?
For example, is better to read all the tables and concatenate all the data (using the primary key) into a single dataframe, or should one maintain the separation and create three dataframes each with their corresponding data found in each CSV table file?