2
\$\begingroup\$

Here's a Python script to optimize JSON documents:

#!/usr/bin/python3 """jopo - Optimize JSON documents.""" from collections import defaultdict def optimize(data): """Optimize a JSON-like data structure.""" if isinstance(data, list): if len(data) == 1: return optimize(data[0]) digest = defaultdict(lambda: defaultdict(set)) for element in data: if not isinstance(element, dict) or len(element) != 2: break for key, value in element.items(): digest[key][type(value)].add(repr(value)) else: if len(digest) == 2: pool = {k for k in digest if len(digest[k][str]) == len(data)} if pool: if len(pool) > 1: meta = input("Which key (%s)? " % "|".join(pool)) else: meta = next(iter(pool)) infra = (digest.keys() - {meta}).pop() return {d[meta]: optimize(d[infra]) for d in data} return [optimize(x) for x in data] if isinstance(data, dict): if len(data) == 1: return optimize(next(iter(data.values()))) else: return {k: optimize(v) for k, v in data.items()} return data if __name__ == "__main__": import json import sys for arg in sys.argv[1:]: with open(arg) as f: doc = json.load(f) print(json.dumps(optimize(doc), indent=2)) 

... which, given a JSON document recipes.json like this:

{ "recipes": [ { "name": "pizza", "ingredients": [ { "quantity": "100g", "ingredient": "cheese" }, { "quantity": "200g", "ingredient": "tomato" } ] }, { "name": "pizza 2", "ingredients": [ { "quantity": "300g", "ingredient": "ham" }, { "quantity": "300g", "ingredient": "pineapple" } ] } ] } 

... is used like this:

$ ./jopo.py recipes.json Which key (ingredient|quantity)? ingredient { "pizza 2": { "ham": "300g", "pineapple": "300g" }, "pizza": { "tomato": "200g", "cheese": "100g" } } 

It does two things:

  1. Collapse 1-item objects and arrays.

  2. Reduce arrays of congruent 2-item objects to a single object, if possible.

I think the optimize() function is both too long and too hard to follow.

Obviously I could separate out the if isinstance(data, <type>): # ... blocks into separate functions, but even then, an optimize_list() function along those lines is still going to be quite long and, IMO, somewhat impenetrable.

How can I improve the legibility of this code?

\$\endgroup\$

    1 Answer 1

    2
    \$\begingroup\$
     digest = defaultdict(lambda: defaultdict(set)) 

    It's pretty rare that we have a good reason for such a nested data structure in python.

     for element in data: if not isinstance(element, dict) or len(element) != 2: break for key, value in element.items(): digest[key][type(value)].add(repr(value)) 

    Why are we adding the repr of the value? I assume you do it to avoid taking the hash of unhashable objects. But what if somebody puts some strange object in your list that has a repr which always returns the same thing. Besides you are only interested in strings.

     else: if len(digest) == 2: 

    This is an unintuive way to check for your requirement. Your requirement is actually that all elements have the same keys. Check for that instead.

     pool = {k for k in digest if len(digest[k][str]) == len(data)} 

    Your actually checking to see if that key was always a string. Perhaps you should check that more directly.

     if pool: if len(pool) > 1: meta = input("Which key (%s)? " % "|".join(pool)) 

    Its deeply suspicious if you need to ask the user for instructions.

     else: meta = next(iter(pool)) infra = (digest.keys() - {meta}).pop() return {d[meta]: optimize(d[infra]) for d in data} return [optimize(x) for x in data] 

    Here is my reworking of it.

    def optimize(data): """ Dispatcher, figure out which algorithm to use to compress the json """ if isinstance(data, list): if len(data) == 1: return optimize(data[0]) else: keys = determine_key(data) if keys is None: # unable to determine tkeys return [optimize(element) for element in data] else: return optimize_list(data, keys) elif isinstance(data, dict): if len(data) == 1: return optimize(next(iter(data.values()))) else: return optimize_dict(data) else: return data def optimize_dict(data): """ Optimize a dict, just optimizes each value """ return {key : optimize(value) for key, value in data.items()} def can_be_key(data, key): """ return true if key could be a key """ if not all( isinstance(element[key], str) for element in data): return False return len(set(element[key] for element in data)) == len(data) def determine_key(data): """ Determine the key, value to compress the list of dicts or None if no such key exists """ for element in data: if not isinstance(element, dict): return None if len(element) != 2: return None if element.keys() != data[0].keys(): return None key1, key2 = data[0].keys() key1_possible = can_be_key(data, key1) key2_possible = can_be_key(data, key2) if key1_possible and key2_possible: meta = input("Which key (%s|%s)? " % (key1, key2)) if meta == key1: return key1, key2 else: return key2, key1 elif key1_possible: return key1, key2 elif key2_possible: return key2, key1 else: return None def optimize_list(data, keys): key_key, key_value = keys return {element[key_key]: optimize(element[key_value]) for element in data} 

    But... I must question the wisdom of the entire approach. The function is actually rather useless. Suppose that you use it on web service that returns a json object. It gives you

    {"results" : [ {"name" : "foo", "age": 42} {"name" : "bar", "age": 36} } 

    Which optimize to

    {"foo": 42, "age": 36}

    So you write:

    for name, age in optimize(the_json).items(): print(name, age) 

    But then in another request, the server returns

    {"results" : [ {"name" : "foo", "age": 42} } 

    which optimizes to:

    42

    So when you run:

    for name, age in optimize(the_json).items(): print(name, age) 

    You get an error.

    Or the server adds extra data to the json:

    { "salinity" : "high", "results" : [ {"name" : "foo", "age": 42, "status" : "sad"} {"name" : "bar", "age": 36, "status" : "happy"} } 

    Which should be backwards compatible using json, but will break your code or anything doing the same job.

    This is not to say that can't or shouldn't simplify the json. But don't try to do it by analyzing the data like this. You should do it by some external record of the schema.

    class Fetch(object): """ Fetch one element from a dictionary """ def __init__(self, name, inner): self._name = name self._inner = inner def __call__(self, data): return self._inner(data[self._name]) class KeyValue(object): """ Given a list of dicts, convert to a dictionary with each key being taken from the `key` on each element and each value being taken from the `value` """ def __init__(self, key, value, inner): self._key = key self._value = value self._inner = inner def __call__(self, data): return {element[self._key] : self._inner(element[self._value]) for element in data} class Literal(object): def __call__(self, data): return data schema = Fetch("recipes", KeyValue("name", "ingredients", KeyValue("ingredient", "quantity", Literal()) ) ) 

    That's actually simpler then what you were doing, and its much more robust. Its also very flexible as you can write code to perform just about any transformation on the data.

    \$\endgroup\$
    6
    • \$\begingroup\$Thanks for the detailed critique ... I'll take some time to give it the attention it deserves, but for the moment, re: "I must question the wisdom of the entire approach" - it's not intended to work with live data (as you say, that would be horribly fragile), but as an aid in analysing the design of a JSON document. The idea came from this SO question, and the realisation that I could get a machine to replicate the rules I'd applied in answering it.\$\endgroup\$
      – user19131
      CommentedAug 9, 2013 at 11:30
    • \$\begingroup\$On your criticisms: I agree the nested defaultdict is less readable than it could be. I'm using it to check complex-ish criteria in as few passes as possible; perhaps I shouldn't. Yes, the repr is for hashability. I'm unaware of any two distinct JSON fragments with the same repr; do you know of any? My requirement is that all objects in the array have the same two keys, so that the array can be converted to an object. I'm checking that the value is always a unique string, so that it can be a key. User input is necessary if both potential keys all have unique string values.\$\endgroup\$
      – user19131
      CommentedAug 13, 2013 at 15:51
    • \$\begingroup\$Given the purpose of the code - suggest improvements to an implicit schema from example data - I can't really use a nonexistent schema to optimize against. My apologies for not being clearer about that in my question.\$\endgroup\$
      – user19131
      CommentedAug 13, 2013 at 15:52
    • \$\begingroup\$And ... it turns out I can't put linebreaks in comments. Further apologies for the mess that's made of my response. If code review is a dialogue, it's kinda tricky in the SO format, isn't it?\$\endgroup\$
      – user19131
      CommentedAug 13, 2013 at 15:53
    • \$\begingroup\$No two JSON fragments will never have the same repr. But somebody could pass something that wasn't json into your code. Its probably not a serious thing to be concerned about. But its a rather odd use of repr.\$\endgroup\$CommentedAug 13, 2013 at 16:04