4
\$\begingroup\$

I'm a very new Haskell programmer, and I've written the following code for loading TSV files. I think it can be improved, but I'm not sure how.

  1. I feel like the giant "do" block in loadData is inelegant, and that there's likely a better way to approach this.
  2. I think I'm trying to avoid the IO monad because I'm not sure how to work with it, but that there's probably a better way to handle mapping parseTSV over the contents of the files.

One note: I'm not super concerned about performance - this will run once at the beginning of my program. I want to load in all the files entirely to build a composite data structure from their contents.

module LanguageMachine (loadData) where import System.Directory (listDirectory) import Data.List ((\\), elemIndex) import Data.Maybe (mapMaybe) parseTsv :: String -> [(String, Int)] parseTsv contents = mapMaybe parseLine (lines contents) parseLine :: String -> Maybe (String, Int) parseLine line = case elemIndex '\t' line of Just i -> let (word, count) = splitAt i line in Just (word, read count :: Int) Nothing -> Nothing loadData :: FilePath -> [FilePath] -> IO [(String, [(String, Int)])] loadData path exclude = do files <- listDirectory path let filtered = files \\ exclude let prefixed = map ((path ++ "/") ++) filtered contents <- traverse readFile prefixed let results = map parseTsv contents return $ zip filtered results 

The files look like this, which are two-value TSV lines of a word and then the number of occurrences of that word:

ARIA 4308 ORE 4208 

Thank you!

\$\endgroup\$

    1 Answer 1

    4
    \$\begingroup\$

    The loadData function looks just fine to me. I don't see a way it could be any more elegant given what it's doing. I do actually like the way you have separated reading the file and parsing it. That's the way to go.

    Your concern about "avoiding the IO monad" shouldn't be a concern. On the contrary, that's actually a "best practice", known as "pushing I/O to the boundary". It means that the majority of your program should be pure, and only a thin shell should deal with external world. This way you can easily test and debug the majority of your program.

    The only thing I might do is make the last three lines one expression:

    traverse readFile prefixed <&> map parseTsv <&> zip filtered 

    I find this a bit more readable, but that's just a matter of personal taste.


    But the parseLine function could sure use some improvement.

    First, the control flow. Look what it's doing: compute a Maybe value, then if it's Just, compute another Maybe value from its contents. That's exactly what "bind" does! This means it can be nicely represented as a do block.

    Second,read is a partial function. This means it can fail at runtime if there is a non-number in the file. I understand this probably Should Never Happen™, but you'd be surprised how many things that "should never happen" do, in fact, happen all the time. :-)

    But since you're already in the Maybe context, it would be easy to use readMaybe instead, and skip the malformed numbers.

    Applying both points:

    parseLine line = do i <- elemIndex '\t' line let (word, strCount) = splitAt i line count <- readMaybe strCount return (word, count) 
    \$\endgroup\$
    2
    • \$\begingroup\$Thank you! The <&> operator is very handy. I'm still getting comfortable thinking in monads, so your feedback is very helpful.\$\endgroup\$CommentedMar 29, 2020 at 20:44
    • \$\begingroup\$Glad I could help!\$\endgroup\$CommentedMar 30, 2020 at 7:42

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.