Parsing expressions with combinators

Question

I'm very new to Haskell's ReadP library, and the entire concept of parser combinators, so I was wondering whether there are better ways to do some things in this program:

import Text.ParserCombinators.ReadP import Control.Applicative import Data.List data Operator = Add | Subtract instance Show Operator where show Add = "+" show Subtract = "-" instance Read Operator where readsPrec _ "+" = [(Add, "")] readsPrec _ "-" = [(Subtract, "")] readsPrec _ _ = [] data Expression = Number Int | Infix { left :: Expression, op :: Operator, right :: Expression } instance Show Expression where show (Number x) = show x show (Infix left op right) = "(" ++ (show left) ++ " " ++ (show op) ++ " " ++ (show right) ++ ")" digit :: ReadP Char digit = satisfy $ \char -> char >= '0' && char <= '9' number :: ReadP Expression number = fmap (Number . read) (many1 digit) operator :: ReadP Operator operator = fmap read (string "+" <|> string "-") expression :: ReadP Expression expression = do skipSpaces left <- number skipSpaces op <- Control.Applicative.optional operator case op of Nothing -> return left Just op -> do skipSpaces right <- expression return (Infix left op right) parseExpression :: String -> Maybe Expression parseExpression input = case readP_to_S expression input of [] -> Nothing xs -> (Just . fst . last) xs

The main area where I'm looking for improvements is the expression function, and specifically the things which seem improvable are the repeated calls to skipSpaces and the case expression to check whether an operator was parsed, but of course if you notice anything else that would be helpful too!

K. A. Buhr · Accepted Answer · 2019-04-26 01:06:07Z

First, a standard approach for dealing with the skipSpaces issue is to define a higher-order parser combinator, traditionally called lexeme:

lexeme :: ReadP a -> ReadP a lexeme p = p <* skipSpaces

Here, lexeme takes a space-naive parser p, and converts it into a new parser that parses whatever p was planning to parse, and then reads and discards any trailing spaces. You use lexeme in the definitions of any of your parsers that would reasonably be assumed to read a complete "lexeme" and ignore any trailing space. For example, number should be a lexeme parser:

number :: ReadP Expression number = lexeme $ fmap (Number . read) (many1 digit)

So should operator, though obviously not digit! expression won't need to use lexeme, because we'll arrange to end it with a lexeme parser.

It's also helpful to define a symbol parser which is essentially string that ignores trailing spaces:

symbol :: String -> ReadP String symbol = lexeme . string

Consistently used, lexeme (and symbol) will deal with all unwanted spaces, other than any leading spaces at the very start of your parse. If you have a non-recursive "top level" grammar production, like say a parser for program :: ReadP Program, then you would probably deal with them there. In your example, you don't have such a production (e.g., expression is recursive), so you'd stick an extra skipSpaces in parseExpression. This is also a good place to put eof to make sure there isn't any trailing material that you aren't parsing:

parseExpression :: String -> Maybe Expression parseExpression input = case readP_to_S (skipSpaces *> expression <* eof) input of [] -> Nothing xs -> (Just . fst . last) xs

Second, your use of a Read instance for parsing your operators is very unusual. It would be more standard to make it a helper in the operator parser, writing something like:

operator :: ReadP Operator operator = readSymbol <$> (symbol "+" <|> symbol "-") where readSymbol "+" = Add readSymbol "-" = Subtract

(though an even more standard version is given below).

Third, in expression, you can avoid the case construct by using alternation (<|>) like so:

expression' :: ReadP Expression expression' = do left <- number (do op <- operator right <- expression return (Infix left op right) <|> return left)

This would be the standard approach for non-parallel parser libraries (e.g., Parsec or Megaparsec). For ReadP, it's better to replace the (<|>) operator with the ReadP-specific (<++) operator to avoid also following the unwanted second parse in parallel. Beware that (<++) has higher precedence than (<|>), so some extra parentheses might be needed if it's being used in combination with other operators, as in the examples below.

Fourth, you've probably noticed my use of the applicative operators <* and *> and the alias <$> for fmap in the code above. It is very common to use these -- plus the additional applicative operator <*> and sometimes the operators <**> or <$ -- in parsers. Once you get used to them, they tend to lead to less cluttered code.

For example, a more standard way of writing expression would be:

expression' :: ReadP Expression expression' = Infix <$> number <*> operator <*> expression <|> number

or the slightly more efficient solution:

expression :: ReadP Expression expression = do left <- number (Infix left <$> operator <*> expression) <++ return left

Note that, in the context of parsers, an expression like f <$> p <*> q means "try to run the parser p, and then the parser q; assuming they both succeed, pass their return values to f". In other words, that Infix expression is essentially:

Infix left op right

where op is the return value from the parser operator and right is the return value from the parser expression.

Similarly, the standard way of writing operator is actually:

operator :: ReadP Operator operator = Add <$ symbol "+" <|> Subtract <$ symbol "-"

This one requires an additional word of explanation. The operator <$ is kind of an odd duck. It's type signature is:

(<$) :: a -> f b -> f a

but in the context of parsers specifically, the meaning of x <$ p is "try to run the parser p; if it succeeds, ignore its return value and return x". Basically, it's used to replace the return value of a parser that's used only for its success or failure and not its return value.

Note that these versions of expression, like your original version, treat the operators as right associative. This may be a problem if you're trying to parse "1-2-3" as equivalent to "(1-2)-3" instead of "1-(2-3)".

A few additional minor points:

isDigit is a more readable name for \c -> c >= '0' && c <= '9'
munch1 is more efficient than many1 (satisfy xxx), so I'd redefine number to use it
for testing, it's probably a good idea to have a parseExpressions function that looks at all the parses
for production, it's probably a good idea to check for ambiguous parses and do something about it, rather than (fairly arbitrarily) selecting the last parse in the list

With all of these suggestions implemented, the final version would look something like:

{-# OPTIONS_GHC -Wall #-} import Data.Char (isDigit) import Control.Applicative import Text.ParserCombinators.ReadP (eof, munch1, ReadP, readP_to_S, skipSpaces, string, (<++)) data Operator = Add | Subtract data Expression = Number Int | Infix { left :: Expression, op :: Operator, right :: Expression } lexeme :: ReadP a -> ReadP a lexeme p = p <* skipSpaces symbol :: String -> ReadP String symbol = lexeme . string number :: ReadP Expression number = lexeme $ Number . read <$> munch1 isDigit operator :: ReadP Operator operator = Add <$ symbol "+" <|> Subtract <$ symbol "-" expression :: ReadP Expression expression = do x <- number (Infix x <$> operator <*> expression) <++ return x top :: ReadP Expression top = skipSpaces *> expression <* eof parseExpressions :: String -> [(Expression, String)] parseExpressions = readP_to_S top parseExpression :: String -> Maybe Expression parseExpression input = case parseExpressions input of [] -> Nothing [(x,"")] -> Just x _ -> error "ambiguous parse"

Stack Exchange Network

Parsing expressions with combinators

1 Answer 1

Hot Network Questions

Parsing expressions with combinators

1 Answer 1

Related

Hot Network Questions