2
\$\begingroup\$

The following code is a module that evaluates boolean expressions, which are entered through an unauthorized web API. The exposed function of the library being used is evaluate(). Its positional argument string is a string of a pythonic boolean expression that is entered by the user and thusly untrustworthy (so it's potentially any string). I implemented some basic checks to ensure that arbitrary code execution is mitigated. Most of the security is outsourced to the callback function used to evaluate the boolness of user-defined tokens. I'd appreciate a review with focus on security, but also appreciate general feedback.

"""Boolean evaluation.""" from typing import Callable, Iterator __all__ = ['SecurityError', 'evaluate'] PARENTHESES = frozenset({'(', ')'}) KEYWORDS = frozenset({'and', 'or', 'not'}) class SecurityError(Exception): """Indicates a possible security breach in parsing boolean statements. """ def bool_val(token: str, callback: Callable[[str], bool]) -> str: """Evaluates the given statement into a boolean value.""" callback_result = callback(token) if isinstance(callback_result, bool): return str(callback_result) raise SecurityError('Callback method did not return a boolean value.') def tokenize(word: str) -> Iterator[str]: """Yields tokens of a string.""" for index, char in enumerate(word): if char in PARENTHESES: yield word[:index] yield char yield from tokenize(word[index+1:]) break else: yield word def boolexpr(string: str, callback: Callable[[str], bool]) -> Iterator[str]: """Yields boolean expression elements for python.""" for word in string.strip().split(): for token in filter(None, tokenize(word)): if token in KEYWORDS or token in PARENTHESES: yield token else: yield bool_val(token, callback) def evaluate( string: str, *, callback: Callable[[str], bool] = lambda s: s.casefold() == 'true' ) -> bool: """Safely evaluates a boolean string.""" return bool(eval(' '.join(boolexpr(string, callback)))) 

Update

This library is used in one place only, namely a real estates filtering API, where web applications can retrieve real estates from a database. By using boolean expressions parsed by this library code, the web applications can filter the retrieved real estates. The actual matching is done in the callback function passed to evaluate, which matches common filtering properties on the respective real estates and then uses evaluate() to assert whether the requested combination of properties is met. E.g. id in [1, 2, 3] and (sales_type == "rent" or price < 250000) which, depending on the real estate, evaluates the matching to True and ( False or True ) which is then passed to eval().

\$\endgroup\$
6
  • \$\begingroup\$Boolean-expr-as-a-service does not seem needed. Can you give any description to the contrary?\$\endgroup\$CommentedJan 3, 2022 at 19:07
  • \$\begingroup\$In what part of the architecture is (for example) your ID membership check converted into a boolean? The client, the web server, this application server?\$\endgroup\$CommentedJan 4, 2022 at 14:10
  • \$\begingroup\$All evaluations are done on the server side.\$\endgroup\$CommentedJan 4, 2022 at 14:13
  • \$\begingroup\$So a client writes an arbitrary expression, sends it to the server, the server evaluates boolean tokens to truth values, fills them in in the expression, the expression itself is evaluated, and then what? Used for filtering in memory, in SQL?\$\endgroup\$CommentedJan 4, 2022 at 14:19
  • \$\begingroup\$In-memory filtering of XML DOM models. Filtering on database-level would be too complex and not significantly improve performance (yes, I tested that).\$\endgroup\$CommentedJan 4, 2022 at 15:19

1 Answer 1

4
\$\begingroup\$

I mean to say this in the gentlest possible way, but from top to bottom this is a bad idea. It would take a more detailed explanation of why this creature exists to convince me otherwise.

Your default callback produces some pretty crazy results: ^ and 1 both evaluate to False, for instance.

A critical step in security is to limit DOS potential by constraining the possible expression input length.

I would replace your PARENTHESES and KEYWORDS with an AST-walk that is very strict on node types.

Your eval uses all-defaults, which means:

  • A code object will be immediately compiled, without being able to examine the AST first
  • Any future-features available to the calling scope are implicitly passed into the compiled scope
  • Just... read this and tell me that it's a good idea:

If both [global and local] dictionaries are omitted, the expression is executed with the globals and locals in the environment where eval() is called.

I would also rethink your callback. If the idea is to have a small collection of names that evaluate to booleans, just pass these in as locals and call it a day. Anything more complicated expands the potential for misuse.

If this needs to exist at all, I would suggest something like

"""Boolean evaluation.""" import ast from types import CodeType from typing import Union EXPR_LIMIT = 100 class SecurityError(Exception): """Indicates a possible security breach in parsing boolean statements. """ def compile_isolated( source: Union[str, ast.Expression], flags: int, ) -> Union[ast.Expression, CodeType]: return compile( source=source, filename='<untrusted_expr>', mode='eval', flags=flags, dont_inherit=1, # no future features, isolated flags optimize=2, # cut asserts and docstrings ) def evaluate(string: str, **bool_locals: bool) -> bool: if len(string) > EXPR_LIMIT: raise SecurityError('Expression too long') expr: ast.Expression = compile_isolated( source=string, flags=ast.PyCF_ONLY_AST, # no await, no pep484, defer compilation ) if not isinstance(expr, ast.Expression): raise SecurityError('Invalid parent node') top_op, = ast.iter_child_nodes(expr) for node in ast.walk(top_op): if isinstance(node, ( ast.UnaryOp, ast.Not, ast.BoolOp, ast.boolop, ast.Name, ast.Load, )): continue if isinstance(node, ast.Constant): literal = ast.literal_eval(node) if not isinstance(literal, bool): raise SecurityError('Invalid literal') else: raise SecurityError('Invalid node') code_object: CodeType = compile_isolated(source=expr, flags=0) isolated_globals = {'__builtins__': {}} for v in bool_locals.values(): if not isinstance(v, bool): raise ValueError('Invalid value for bool_locals') result = eval(code_object, isolated_globals, bool_locals) if not isinstance(result, bool): raise SecurityError('Unexpected non-boolean output') return result def test() -> None: assert evaluate('not (a and False) or True', a=True) is True assert evaluate('not False') is True assert evaluate('a and (b or c)', a=True, b=True, c=False) is True for bad in ( '__loader__ or True', 'exit', 'quit', 'eval and False', ): try: evaluate(bad) raise AssertionError('Should have failed') except NameError: pass for bad in ( '1 or False', 'int(False)', 'False * True', 'exit()', 'dos'*40, ): try: evaluate(bad) raise AssertionError('Should have failed') except SecurityError: pass if __name__ == '__main__': test() 
\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.