Loading Files from a Directory in Haskell

Question

I'm a very new Haskell programmer, and I've written the following code for loading TSV files. I think it can be improved, but I'm not sure how.

I feel like the giant "do" block in loadData is inelegant, and that there's likely a better way to approach this.
I think I'm trying to avoid the IO monad because I'm not sure how to work with it, but that there's probably a better way to handle mapping parseTSV over the contents of the files.

One note: I'm not super concerned about performance - this will run once at the beginning of my program. I want to load in all the files entirely to build a composite data structure from their contents.

module LanguageMachine (loadData) where import System.Directory (listDirectory) import Data.List ((\\), elemIndex) import Data.Maybe (mapMaybe) parseTsv :: String -> [(String, Int)] parseTsv contents = mapMaybe parseLine (lines contents) parseLine :: String -> Maybe (String, Int) parseLine line = case elemIndex '\t' line of Just i -> let (word, count) = splitAt i line in Just (word, read count :: Int) Nothing -> Nothing loadData :: FilePath -> [FilePath] -> IO [(String, [(String, Int)])] loadData path exclude = do files <- listDirectory path let filtered = files \\ exclude let prefixed = map ((path ++ "/") ++) filtered contents <- traverse readFile prefixed let results = map parseTsv contents return $ zip filtered results

The files look like this, which are two-value TSV lines of a word and then the number of occurrences of that word:

ARIA 4308 ORE 4208

Thank you!

Fyodor Soikin · Accepted Answer · 2020-03-29 15:32:46Z

The loadData function looks just fine to me. I don't see a way it could be any more elegant given what it's doing. I do actually like the way you have separated reading the file and parsing it. That's the way to go.

Your concern about "avoiding the IO monad" shouldn't be a concern. On the contrary, that's actually a "best practice", known as "pushing I/O to the boundary". It means that the majority of your program should be pure, and only a thin shell should deal with external world. This way you can easily test and debug the majority of your program.

The only thing I might do is make the last three lines one expression:

traverse readFile prefixed <&> map parseTsv <&> zip filtered

I find this a bit more readable, but that's just a matter of personal taste.

But the parseLine function could sure use some improvement.

First, the control flow. Look what it's doing: compute a Maybe value, then if it's Just, compute another Maybe value from its contents. That's exactly what "bind" does! This means it can be nicely represented as a do block.

Second,read is a partial function. This means it can fail at runtime if there is a non-number in the file. I understand this probably Should Never Happen™, but you'd be surprised how many things that "should never happen" do, in fact, happen all the time. :-)

But since you're already in the Maybe context, it would be easy to use readMaybe instead, and skip the malformed numbers.

Applying both points:

parseLine line = do i <- elemIndex '\t' line let (word, strCount) = splitAt i line count <- readMaybe strCount return (word, count)

Thank you! The <&> operator is very handy. I'm still getting comfortable thinking in monads, so your feedback is very helpful. — Erty Seidohl, CommentedMar 29, 2020 at 20:44

Stack Exchange Network

Loading Files from a Directory in Haskell

1 Answer 1

Hot Network Questions

Loading Files from a Directory in Haskell

1 Answer 1

Related

Hot Network Questions