portable hashable string representation of data object

Question

I have a class of (flat) objects that are going to be passed around between three different parties, each possibly running a different software stack.

my_object: item: "string_a" money: amount: 99.9 currency: "EUR" metadata: 5 origin: "https://www.example.com/path"

These objects will need to be signed and the signatures will need to be verifiable.
In order for the signed hash to be stable from one party to another, there has to be an exact canonical string representation.

EDIT:
For example, both the following are valid string representations of the above object, but they'll have different hashes.

<my_object xmlns="https://www.TBD.com"> <item>string_a</item> <money> <amount>99.9</amount> <currency>EUR</currency> </money> <metadata>5</metadata> <origin>https://www.example.com/path</origin> </my_object>

<my_object xmlns="https://www.TBD.com"> <item>string_a</item> <money><amount>99.9</amount><currency>EUR</currency></money> <metadata>5.0</metadata> <origin>https://www.example.com/path</origin> </my_object>

All parties will be parsing the objects, so it's the values that need to be verifiable in the signature process. In the above example, the data that needs to be signed is
{"string_a", 99.9, "EUR", 5, "https://www.example.com/path"}.
Of course there must be no ambiguity about which value goes with which key; for example it must be clear that 5 is the metadata and not the amount.
That said, it seems reasonable to have the original serialization of the data treated as a data-value in-itself, and the contained data parsed out of it as needed. This would solve the above problem of a byte-perfect canonical version.
END EDIT

I think it's a good idea to define the "outer" signed object in XML.

<signed_object xmlns="https://www.TBD.com"> <data>string representation of the inner data object</data> <signature>RSA signature of the SHA512 hash of the value in `data`</signature> </signed_object>

Clearly I have a variety of options for what should go in that inner <data> tag.

An escaped XML string would work.
- We know all the parties have xml parsing set up.
- We could help with verification by providing an XSD.
- The relatively large size of an xml string isn't a big deal because this will all get compressed.
- On the other hand, it would be a pain for a human to look at an xml-escaped xml string and understand what was going on.
- Also, a naive implementer might try to rebuild the XML from the contained data, get a different hash, and decline a valid signature.
JSON and Yaml might be slightly better than XML for human readability, but would have the same problem that "equivalent" objects could have different hashes.
A delimited string (with commas, pipes, etc) would more be human-readable, and would clearly be string data in itself.
Taking that idea even farther, we could provide a canonical regex with capture groups for unambiguously validating and parsing these strings.
And finally, we could decide not to worry about having a canonical string version of each such object, and instead have a really well defined process for serializing the data for hashing.
- This would look the best.
- This would be the easiest to screw up.

EDIT:
Expanding on that last option (because it's the one I've actually seen done in the wild):

Simply concatenating the values in an explicit order probably won't work: what if item ends with numeric characters?
Concatenating the keys and values in an explicit order would probably work, the only problems I can think of are ambiguities in how to represent numbers (5 vs 5.0), or possibly conversions between UTF-8/UTF-32/ASCII.

I like the idea of defining the string format using a regex. Is that a bad idea for any reason?

EDIT: I'll be asking other people to implement parts of this system, on a variety of platforms. What system will be the easiest and/or most reliable to implement, in general?

What exactly must be signed? Do you need to sign the serialized state, e.g., the XML string? Or do you need to sign the object data, e.g., the values "string_a", 99., "EUR", etc.? Both are valid use cases: The first case allows you to verify an object by comparing its serialization string, e.g., the XML, against its hash, which can be done without actually parsing the XML. The second case allows you to verify the object only after deserialization, which means that this is completely independent from the serialization format and you can use (and even mix) XML, JSON, etc. — pschill, CommentedJun 19, 2019 at 10:11
It's the data that needs to be authenticated by the signature; I've been assuming that authenticating the serialized state will be easier. I've edited above for clarity. — ShapeOfMatter, CommentedJun 19, 2019 at 16:17

Michael Kay · Accepted Answer · 2019-06-19 08:19:27Z

-1

XML, JSON, and YAML should all be quite capable of this. XML has more tooling available, JSON is simpler, YAML has a bit of a cult following but is far less widely used.

The worst of all possible options is to invent your own proprietary format, which is what you seem to be suggesting in your last sentence.

answered Jun 19, 2019 at 8:19

Michael Kay

3,5991 gold badge16 silver badges13 bronze badges

XML, JSON, and YAML are all fine serialization formats, the problem is that within each scheme there might be different string representations of a given abstract object, which could cause false-negatives when people are trying to validate the signatures. Supposedly this is a solved problem for XML, but I haven't been able to find those solutions. I remember reading that they're complicated to implement, and I'd rather not ask participants to take on an extra dependency, even assuming cross-language/system portable libraries exist.
– ShapeOfMatter
CommentedJun 19, 2019 at 16:21
Whatever format you use, you're going to have to define canonical representations if you want signatures to match. For example, with XML, element order is significant when computing a signature, so you need to define the element order (which can be done by constraining it using an XSD schema).
– Michael Kay
CommentedJun 19, 2019 at 16:43

Add a comment |

Stack Exchange Network

portable hashable string representation of data object

1 Answer 1

Hot Network Questions

portable hashable string representation of data object

1 Answer 1

Related

Hot Network Questions