digest = defaultdict(lambda: defaultdict(set))
It's pretty rare that we have a good reason for such a nested data structure in python.
for element in data: if not isinstance(element, dict) or len(element) != 2: break for key, value in element.items(): digest[key][type(value)].add(repr(value))
Why are we adding the repr of the value? I assume you do it to avoid taking the hash of unhashable objects. But what if somebody puts some strange object in your list that has a repr which always returns the same thing. Besides you are only interested in strings.
else: if len(digest) == 2:
This is an unintuive way to check for your requirement. Your requirement is actually that all elements have the same keys. Check for that instead.
pool = {k for k in digest if len(digest[k][str]) == len(data)}
Your actually checking to see if that key was always a string. Perhaps you should check that more directly.
if pool: if len(pool) > 1: meta = input("Which key (%s)? " % "|".join(pool))
Its deeply suspicious if you need to ask the user for instructions.
else: meta = next(iter(pool)) infra = (digest.keys() - {meta}).pop() return {d[meta]: optimize(d[infra]) for d in data} return [optimize(x) for x in data]
Here is my reworking of it.
def optimize(data): """ Dispatcher, figure out which algorithm to use to compress the json """ if isinstance(data, list): if len(data) == 1: return optimize(data[0]) else: keys = determine_key(data) if keys is None: # unable to determine tkeys return [optimize(element) for element in data] else: return optimize_list(data, keys) elif isinstance(data, dict): if len(data) == 1: return optimize(next(iter(data.values()))) else: return optimize_dict(data) else: return data def optimize_dict(data): """ Optimize a dict, just optimizes each value """ return {key : optimize(value) for key, value in data.items()} def can_be_key(data, key): """ return true if key could be a key """ if not all( isinstance(element[key], str) for element in data): return False return len(set(element[key] for element in data)) == len(data) def determine_key(data): """ Determine the key, value to compress the list of dicts or None if no such key exists """ for element in data: if not isinstance(element, dict): return None if len(element) != 2: return None if element.keys() != data[0].keys(): return None key1, key2 = data[0].keys() key1_possible = can_be_key(data, key1) key2_possible = can_be_key(data, key2) if key1_possible and key2_possible: meta = input("Which key (%s|%s)? " % (key1, key2)) if meta == key1: return key1, key2 else: return key2, key1 elif key1_possible: return key1, key2 elif key2_possible: return key2, key1 else: return None def optimize_list(data, keys): key_key, key_value = keys return {element[key_key]: optimize(element[key_value]) for element in data}
But... I must question the wisdom of the entire approach. The function is actually rather useless. Suppose that you use it on web service that returns a json object. It gives you
{"results" : [ {"name" : "foo", "age": 42} {"name" : "bar", "age": 36} }
Which optimize to
{"foo": 42, "age": 36}
So you write:
for name, age in optimize(the_json).items(): print(name, age)
But then in another request, the server returns
{"results" : [ {"name" : "foo", "age": 42} }
which optimizes to:
42
So when you run:
for name, age in optimize(the_json).items(): print(name, age)
You get an error.
Or the server adds extra data to the json:
{ "salinity" : "high", "results" : [ {"name" : "foo", "age": 42, "status" : "sad"} {"name" : "bar", "age": 36, "status" : "happy"} }
Which should be backwards compatible using json, but will break your code or anything doing the same job.
This is not to say that can't or shouldn't simplify the json. But don't try to do it by analyzing the data like this. You should do it by some external record of the schema.
class Fetch(object): """ Fetch one element from a dictionary """ def __init__(self, name, inner): self._name = name self._inner = inner def __call__(self, data): return self._inner(data[self._name]) class KeyValue(object): """ Given a list of dicts, convert to a dictionary with each key being taken from the `key` on each element and each value being taken from the `value` """ def __init__(self, key, value, inner): self._key = key self._value = value self._inner = inner def __call__(self, data): return {element[self._key] : self._inner(element[self._value]) for element in data} class Literal(object): def __call__(self, data): return data schema = Fetch("recipes", KeyValue("name", "ingredients", KeyValue("ingredient", "quantity", Literal()) ) )
That's actually simpler then what you were doing, and its much more robust. Its also very flexible as you can write code to perform just about any transformation on the data.