2
\$\begingroup\$

I have completed an ETL project to collect, parse and load files. I decided to make it clean OOP way using interfaces and abstract, but have some questions below.

Sub Main() Dim collectionOfParsers As New List(Of EtlParser) Dim xmlparser1 As New XmlParser Dim xmlparser2 As New XmlParser Dim xmlparser3 As New XmlParser Dim txtparser1 As New TxtParser Dim txtparser2 As New TxtParser collectionOfParsers.Add(xmlparser1) collectionOfParsers.Add(xmlparser2) collectionOfParsers.Add(xmlparser3) collectionOfParsers.Add(txtparser1) collectionOfParsers.Add(txtparser2) For Each parser As EtlParser In collectionOfParsers parser.SaySomething() Dim canOpenFiles = TryCast(parser, ICanOpenFiles) If (canOpenFiles IsNot Nothing) Then canOpenFiles.OpenFiles() End If Dim canReadFiles = TryCast(parser, ICanReadFiles) If (canReadFiles IsNot Nothing) Then canReadFiles.Readfiles() End If Dim canTransFiles = TryCast(parser, ICanTransformFiles) If (canTransFiles IsNot Nothing) Then canTransFiles.TransformFile() End If Dim canSaveFiles = TryCast(parser, ICanSaveFiles) If (canSaveFiles IsNot Nothing) Then canSaveFiles.Savefiles() End If Next End Sub Public MustInherit Class Etl End Class Public MustInherit Class EtlParser : Inherits Etl Protected Sub CanParse() Console.WriteLine("Yes") End Sub Protected Overridable Sub SaySomething() Console.WriteLine("EtlParser say something") End Sub Protected MustOverride Sub CanParseFormat() End Class Public Interface ICanOpenFiles Sub OpenFiles() End Interface Public Interface ICanReadFiles Sub Readfiles() End Interface Public Interface ICanSaveFiles Sub Savefiles() End Interface Public Interface ICanTransformFiles Sub TransformFile() End Interface Public Class XmlParser : Inherits EtlParser Implements ICanOpenFiles, ICanReadFiles, ICanTransformFiles, ICanSaveFiles Public Sub OpenFiles() Implements ICanOpenFiles.OpenFiles Throw New NotImplementedException() End Sub Public Sub Readfiles() Implements ICanReadFiles.Readfiles Throw New NotImplementedException() End Sub Public Sub TransformFile() Implements ICanTransformFiles.TransformFile Throw New NotImplementedException() End Sub Public Sub Savefiles() Implements ICanSaveFiles.Savefiles Throw New NotImplementedException() End Sub Protected Overrides Sub CanParseFormat() Throw New NotImplementedException() End Sub Protected Overrides Sub SaySomething() 'MyBase.SaySomething() Console.WriteLine("XmlParser say something") End Sub End Class Public Class CsvParser : Inherits EtlParser Implements ICanOpenFiles, ICanReadFiles, ICanTransformFiles, ICanSaveFiles Public Sub OpenFiles() Implements ICanOpenFiles.OpenFiles Throw New NotImplementedException() End Sub Public Sub Readfiles() Implements ICanReadFiles.Readfiles Throw New NotImplementedException() End Sub Public Sub TransformFile() Implements ICanTransformFiles.TransformFile Throw New NotImplementedException() End Sub Public Sub Savefiles() Implements ICanSaveFiles.Savefiles Throw New NotImplementedException() End Sub Protected Overrides Sub CanParseFormat() Throw New NotImplementedException() End Sub Protected Overrides Sub SaySomething() 'MyBase.SaySomething() Console.WriteLine("CsvParser say something") End Sub End Class 

Q1: Once i collect the files from network drive (this will be done by Collector later on). What is your opinion should i make xmlparser class to handle many files or just one? If the second option then as you can see i created already many xmlparser instances (1 instance per each file), however i am not sure here maybe should i have xmlparser prepared for all files and then call it just once?

Q2: Regarding the for each loop i parametrized common type as EtlParser to pass diffrent specific parsers (is it ok by the way?). Can you explain me how it's possible specific parser within the loop is seen as passed object type - for instance i passed XmlParser and within i see it as well - i thought that when passing specific parser e.g XmlParser through parameter (his parent - EtlParser) it becomes EtlParser and i have to cast it again to XmlParser again inside loop. Would like to understand that.

Q3: As long as i know definition of interfaces e.g "Need to provide common functionality to unrelated classes" what in my example code is real benefit as all of my specific parsers uses the same interfaces at the end? All can open, read, transform and save...

Q4: As you see i have 3 specific parser classes: CsvParser, XmlParser, TxtParser inheriting from their base EtlParser class. Wouldn't it be better to make one parser class and instead make interface IXml, ITxt, ICsv which will be implemented? At this moment i think what i have is proper.

Q5: Why in the Main method i cannot do: parser.SaySomething() However when i look at parser item it shows exactly correct type.

Q6: Any ideas, advices to my current code besides?

\$\endgroup\$

    1 Answer 1

    3
    \$\begingroup\$

    Q1: It takes nanoseconds to create an object and milliseconds to access a file; i.e. roughly one million times longer! Don't try to optimize things that will have absolutely no noticeable effect at the expense of clarity!

    Q2: Since XmlParser has no methods specific to XmlParser (i.e. existing only in XmlParser), there is no advantage in casting the object to it. But since the base class EtlParser does not implement the interfaces, you must cast the object to these interfaces (what you are doing).

    Q3, Q4, Q6: This is one possible approach. I will suggest you another one.

    Q5: SaySomething() is Protected, which means that it is only visible within the class defining it and its descendants. Make it Public.


    Critics: Your interface makes operations like opening files public. The caller then must know whether this operations is available and call it. But this is a technical implementation detail which should be kept private. A public interface should concentrate on the desired high level logic. I.E. read data, transform data and maybe write data.

    Suggestion: I would choose a more flexible approach allowing you to compose parsers from single components (like Lego bricks). Define this set of interfaces:

    Public Interface IDataSource(Of T) Function Read() As IEnumerable(Of T) End Interface Public Interface ITransformer(Of TSource, TResult) Function Transform(ByVal source As IEnumerable(Of TSource)) As IEnumerable(Of TResult) End Interface Public Interface IDataSink(Of T) Sub Write(ByVal data As IEnumerable(Of T)) End Interface 

    The idea is to implement these interfaces by different classes. You would have one class for an XML-data-source, one for a file-data-source, one for a transformation, etc.

    A data source can be a text-file an XML-file a database or be a dummy data source for test purposes. It is the data source’s responsibility to open, read and close files etc. You don't need separate interfaces for all these operations.

    Note that file names and connection strings can be passed as constructor parameters and don't need to be specified in the interfaces.

    Define classes serving as transport vehicle for single data records like RawData, PreProcessedData, RefinedData used as generic type arguments for the interfaces. You will probably choose names for these classes that are better suited for your specific problem.

    You can even chain several transformations like this:

    read >>(RawData)>> transform 1 >>(PreProcessedData)>> transform 2 >>(RefinedData)>> write 

    One advantage of this approach is that you can apply the same transformations to different types of data sources (having the same TSource) and store the result into different types of destinations (having the same TResult).

    Note: Iterators (Visual Basic) will help you to implement these interfaces.


    Let's make a very simple example. We have a CSV-File with a name column and two number columns. We want to transform this file into another one containing the name column and one number column containing the sum of the two numbers.

    Input file:

    Joe,3,4
    Mike,6,2
    Sue,10,3

    Expected output file:

    Joe,7
    Mike,8
    Sue,13

    We need two data classes

    Public Class InputData Public Property Name As String Public Property X As Integer Public Property Y As Integer End Class Public Class OutputData Public Property Name As String Public Property Sum As Integer End Class 

    A reader

    Public Class ExampleCsvReader Implements IDataSource(Of InputData) Private m_filename As String Public Sub New(ByVal filename As String) m_filename = filename End Sub Public Iterator Function Read() As IEnumerable(Of InputData) _ Implements IDataSource(Of InputData).Read For Each line As String In File.ReadLines(m_filename) Dim parts = line.Split(","c) If parts.Length = 3 Then Yield New InputData With {.Name = parts(0), _ .X = CInt(parts(1)), .Y = CInt(parts(2))} End If Next End Function End Class 

    A transformer

    Public Class ExampleTransformer Implements ITransformer(Of InputData, OutputData) Public Iterator Function Transform(source As IEnumerable(Of InputData)) _ As IEnumerable(Of OutputData) _ Implements ITransformer(Of InputData, OutputData).Transform For Each record As InputData In source Yield New OutputData With {.Name = record.Name, .Sum = record.X + record.Y} Next End Function End Class 

    A writer

    Public Class ExampleCsvWriter Implements IDataSink(Of OutputData) Private m_filename As String Public Sub New(ByVal filename As String) m_filename = filename End Sub Public Sub Write(data As IEnumerable(Of OutputData)) _ Implements IDataSink(Of OutputData).Write Using sw As StreamWriter = File.CreateText(m_filename) For Each record As OutputData In data sw.WriteLine($"{record.Name},{record.Sum}") Next End Using End Sub End Class 

    And finally we can stitch the parts together

    Dim reader = New ExampleCsvReader(inputFile) Dim transformer = New ExampleTransformer() Dim writer = New ExampleCsvWriter(outputFile) Dim inputData = reader.Read() Dim outputData = transformer.Transform(inputData) writer.Write(outputData) 

    Generic solution: This approach also lets you also realize a more generic solution. You are free to create generic readers that for instance return data in a dictionary. The data type could be a Dictionary(Of String, Object) for instance, storing property name/value pairs. A reader could implement a IDataSource(Of Dictionary(Of String, Object)), for instance.

    VB specific: The Yield statement is like a Return statement that returns a value, but unlike the latter, it does not exit the function and continues its execution to return the next value of the enumeration, and so on, until the end of the function is reached.

    Besides iterators I also used Object Initializers, String Interpolation (Point 12.), Using Statement.

    \$\endgroup\$
    9
    • \$\begingroup\$First of all thank you very much Olivier for taking your time to help me out. To be honest wit you i read your post x times and can't get full picture of propsoed solution which somehow seems to be very good. What is not clear to me is do you propose to have one class inherits from EtlParser for instnace MainParser for all diffrent sources and implement all your proposed interfaces? Is it what you mean as for now i have XmlParser/CsvParser/TxtParser do you mean to make one for all and implement your interfaces?\$\endgroup\$
      – Arie
      CommentedSep 16, 2017 at 20:38
    • \$\begingroup\$Would it be piossible to extend your answer which could show the whole change to my solution that i could understand fully? I also do not get the iterator approach which could help me out in the solution also would appreciate to get it on example. The best would be also to see where (you mentioned constructors) or how specific file path should be passed and more intresting where specific parser functionality like this for parsing csv's, txt's and database should be placed within solution. If you could put it in one peace i would really appreciate. Thank so much dude !\$\endgroup\$
      – Arie
      CommentedSep 16, 2017 at 20:41
    • \$\begingroup\$Looks nice however you said before that "You would have one class for an XML-data-source, one for a file-data-source.." However from what i see if i will have diffrent csv file than what you shown in example i'd have to implement new class e.g ExampleCsvReaderX so would mean each diffrent file structure is equal to create new ExampleCsvReaderX class and both InputDataX and OutputDataX classes for it. Am i right with this? Also if diffrent transformer new ExampleTransformerX has to be created. If i am not correct can you show based on your example how new diffrent csv file would be implemented?\$\endgroup\$
      – Arie
      CommentedSep 18, 2017 at 9:58
    • \$\begingroup\$P.S I can make new topic if you like. Let me know. I already marked this as an asnwer. Thanks so much Olivier your help is invaluable to me.\$\endgroup\$
      – Arie
      CommentedSep 18, 2017 at 9:59
    • \$\begingroup\$Well, you are free to create generic readers that for instance return data in a dictionary. The data type could be a Dictionary(Of String, Object) for instance, storing property name/value pairs.\$\endgroup\$CommentedSep 18, 2017 at 10:10

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.