Assign unique identifiers to binary matrices using a binary tree

Question

Context

I want to build a transposition table where the game states are binary square matrices which size MATRIX_SIZE is known at compile time. Rather than storing these matrices, I want to store their "hash". More precisely, I want to build a function that, upon being given such a matrix returns a unique identifier. My requirements are:

Performance: computing a unique identifier must be as fast as possible, since the function will be called a lot.
Perfect collision-resistance: the unique identifier must be, well, unique. Collisions are not accepted.
Parallelism: at some point in time, several threads would want to compute uid for different matrices.

The first idea that comes to mind for tackling such a problem is using well-known hash functions. However:

The basic hash functions in Rust make a trade off between performance and collision resistance.
I expect to see more than 2^32 matrices, which means I would have to use tags that can't be stored on 64-bit numbers, because of the birthday paradox. This would in tun prevent me from using these tags directly as keys for a structure such that an HashMap or an IndexMap.
Even the most performant hash functions such as SHA3 make several computations while reading the output.

My idea is that since I know that my inputs are always going to be the same size, I can exploit this property. Fundamentally, what I want to do is to store the matrices that I see. Upon having to give a unique identifier to a matrix, I check whether this matrix has already been given an identifier. Otherwise, I increase a global counter by one and assign it to this matrix.

For instance, an idea that could work is:

Start with an empty vec!
Compute a (very) large integer which binary decomposition is given by the matrix
If this integer is within the vec!, returns its index within the vec!
Otherwise, insert it into the vec in a sorted way

Inserting and searching are efficient since the vec! is sorted (for instance, one may use the binary_search method of such objects). In fact, this could be a potential solution to my problem, except that the vec will always be growing. From my understanding, this means that for every next power of 2, a new allocation would have to take place, copying the whole vec! each time. Plus, inserting an element is clearly not thread-safe.

The solution I went with (for now?) is to implement a tree that will store the matrices it has seen in its leaves, along with their uid. For instance, let us assume MATRIX_SIZE=4. We start from a tree that is just a root:

 Node |---+---| None None

Suppose that we now want to compute the unique identifier of [0, 0, 0, 1] (note that we consider the flatten matrices, since we just see them as iterators). The tree will then look like this:

 Node |-------------------------------+-------------------------------| Node None |---------------+---------------| Node None |-------+-------| Node None |---+---| None Leaf(0)

If we now store [0, 1, 0, 1], the tree will look like this:

 Node |-------------------------------+-------------------------------| Node None |---------------+---------------| Node Node |-------+-------| |-------+-------| Node None Node None |---+---| |---+---| None Leaf(0) None Leaf(1)

The idea is that:

We only have to read the data and progress within the tree along with it, there's no additional computations
As soon as we find a None, we know that this is a new stream, so we can directly return the new unique identifier and let another thread expand the tree in the background (though this is not yet implemented, I wrote the code with this idea in mind, which is why expand is a separate function).
It seems not to suffer from the same problems as vec!
It seems a little bit more thread-friendly (though I'm not sure about this).

However:

Just like the vec! solution, it may require a lot of memory, contrarily to a stateless hash function
It requires a lot of "jumping around" in memory and I'm not sure whether this affects performance.

The code

Here's my attempt at implementing such a tree in Rust. Note that I haven't taken care of the parallelism aspect for now. Since I'm a beginner in Rust, every feedback is much appreciated, be it performance-wise, on good practices or on the algorithm itself.

Here's the main.rs file:

// main.rs mod constants; mod uid_assigner; fn main() { println!("Hello, world!"); }

Here's the constants.rs file:

// constants.rs const N_QUBITS: usize = 5; pub const MATRIX_SIZE: usize = N_QUBITS * N_QUBITS;

And finally, here's the uid_assigner.rs file:

// uid_assigner.rs use crate::constants::MATRIX_SIZE; use crate::uid_assigner::UidAssignerTree::{Leaf, Node}; // Only used to initialize an array full of None const NONE_INIT: Option<Box<UidAssignerTree>> = None; /// Each Node in the UidAssignerTree has the same number of children. This /// value must be equal to the largest integer the assigner will ever meet in /// the stream. /// /// Since we are dealing with binary streams, its value is currently set to 2. const TREE_BRANCHES: usize = 2; /// The enum that represents the tree used for assigning unique identifiers. The /// data is stored in the leaves only, the nodes are only meant to lead to the /// leaves. /// /// The rationale is that it is possible to both access and create an entry in /// $n$ steps, where $n$ is the length of the stream. Note that this assigner /// assumes that every stream to give an identifier to has the same length $n$. /// Otherwise, the `get_uid` function will panic. enum UidAssignerTree { // FIXME: This makes the leaves "heavy", since they must be able to be // casted to an array of pointers. That means that storing a u64 is as // heavy as storing TREE_BRANCHES usizes Leaf(u64), Node([Option<Box<UidAssignerTree>>; TREE_BRANCHES]), } /// The UidAssigner is the struct the user will be exposed to in order to assign /// unique identifiers to streams. pub(crate) struct UidAssigner { /// The `uid_assigner` is the structure that contains all previous /// identifiers and is in charge of creating new ones. uid_assigner: UidAssignerTree, /// The `next_id` simply represents the identifier that will be affected to /// the next stream to which no identifier is associated yet. next_id: u64, } impl UidAssignerTree { /// Creates the root of a new UidAssignerTree. fn new() -> Self { Node([NONE_INIT; TREE_BRANCHES]) } /// Expands the tree until a new leaf. /// /// This function is only called whenever a None Node is found when looking /// for a new uid in the `get_uid` function. It uses the knowledge that all /// upcoming Nodes are going to be None to expand the tree in a branchless /// way. fn expand<'a>(&mut self, mut data: impl Iterator<Item = &'a u8>, next_uid: &mut u64) { match data.next() { // If the iterator isn't empty yet, we have to create the children // of the current node. Some(index) => { if let Node(children) = self { let child_reference = &mut children[*index as usize]; *child_reference = Some(Box::new(UidAssignerTree::new())); child_reference.as_mut().unwrap().expand(data, next_uid); } else { // The nodes we encounter are those we previously created, // so we know for certain that they're not leaves unreachable!() } } // Otherwise, the current node should be a Leaf to which we can // assign the new uid. None => { *self = Leaf(*next_uid); *next_uid += 1; } } } /// Computes the uid of a stream of data. It does so by either assigning a /// new uid to the stream if it is the first time the function is called /// with this particular stream, or by giving back the previously associated /// uid. fn get_uid<'a>(&mut self, mut data: impl Iterator<Item = &'a u8>, next_uid: &mut u64) -> u64 { match self { Node(children) => { let child_reference = &mut children[*data.next().unwrap() as usize]; if !child_reference.is_none() { child_reference.as_mut().unwrap().get_uid(data, next_uid) } else { *child_reference = Some(Box::new(UidAssignerTree::new())); let uid = *next_uid; child_reference.as_mut().unwrap().expand(data, next_uid); uid } } Leaf(uid) => *uid, } } } impl UidAssigner { pub(crate) fn new() -> Self { Self { uid_assigner: UidAssignerTree::new(), next_id: 0, } } /// Computes the unique identifier of a stream of data. /// /// The goal of this function is to set a unique identifier to every stream /// of data it is given. It does so by storing the stream in a tree-like /// structure, which allows efficient search and creation. It is guaranteed /// that the same stream will always be associated to the same identifier /// and that two different streams will never be associated to the same /// identifier, as long as the number of different streams doesn't exceed /// $2^64$. /// /// # Arguments /// /// * `data` - The stream of data to assign an identifier to. Note that it /// must satisfy two strict conditions: /// - Each stream of data must be of the same length. /// - Each stream of data mustn't contain integers larger than /// `TREE_BRANCHES`. /// If one of these conditions isn't satisfied, this function will panic. pub(crate) fn get_uid<'a>(&mut self, data: impl ExactSizeIterator<Item = &'a u8>) -> u64 { if data.len() != MATRIX_SIZE { panic!( "Expected iterator of size {}, {} given", MATRIX_SIZE, data.len() ); } self.uid_assigner.get_uid(data, &mut self.next_id) } } #[cfg(test)] mod tests { use crate::constants::MATRIX_SIZE; use crate::uid_assigner::UidAssigner; /// This test ensures that a stream is always associated to the same uid. #[test] fn consistency_test() { let mut uid_assigner = UidAssigner::new(); let array0: [u8; MATRIX_SIZE] = [0; MATRIX_SIZE]; // Check that asking twice for a uid gives the same uid. assert_eq!( uid_assigner.get_uid(array0.iter()), uid_assigner.get_uid(array0.iter()) ); let array1: [u8; MATRIX_SIZE] = core::array::from_fn(|i| if i == MATRIX_SIZE - 1 { 0 } else { 1 }); uid_assigner.get_uid(array1.iter()); // Check that the previous property remains true even after having // inserted another record in the tree. assert_eq!( uid_assigner.get_uid(array0.iter()), uid_assigner.get_uid(array0.iter()) ); } /// This test ensures that the uid are assigned in a consecutive order #[test] fn consecutive_id_test() { let mut uid_assigner = UidAssigner::new(); let array0: [u8; MATRIX_SIZE] = [0; MATRIX_SIZE]; let array1: [u8; MATRIX_SIZE] = core::array::from_fn(|i| if i == MATRIX_SIZE - 1 { 0 } else { 1 }); let array2: [u8; MATRIX_SIZE] = core::array::from_fn(|i| if i == MATRIX_SIZE - 2 { 0 } else { 1 }); let array3: [u8; MATRIX_SIZE] = core::array::from_fn(|i| if i >= MATRIX_SIZE - 2 { 0 } else { 1 }); // Check that the uid of three different streams are the first natural // numbers assert_eq!(uid_assigner.get_uid(array0.iter()), 0); assert_eq!(uid_assigner.get_uid(array1.iter()), 1); assert_eq!(uid_assigner.get_uid(array2.iter()), 2); // Check that the previous property remains true even after asking for // an already existent uid. assert_eq!(uid_assigner.get_uid(array0.iter()), 0); assert_eq!(uid_assigner.get_uid(array3.iter()), 3); } /// This test ensures that the `get_uid` function panics upon being given an /// iterator with a length shorter than `MATRIX_SIZE`. #[test] #[should_panic(expected = "Expected iterator of size ")] fn panic_on_too_short_iterator_test() { let mut uid_assigner = UidAssigner::new(); let array0: [u8; MATRIX_SIZE - 1] = [0; MATRIX_SIZE - 1]; uid_assigner.get_uid(array0.iter()); } /// This test ensures that the `get_uid` function panics upon being given an /// iterator with a length larger than `MATRIX_SIZE`. #[test] #[should_panic(expected = "Expected iterator of size ")] fn panic_on_too_long_iterator_test() { let mut uid_assigner = UidAssigner::new(); let array0: [u8; MATRIX_SIZE + 1] = [0; MATRIX_SIZE + 1]; uid_assigner.get_uid(array0.iter()); } }

If some parts of your matrices are common, I'd use a trie. Are they? ... What do you mean by: several computations while reading the output? — Peter Blackson, CommentedDec 15, 2023 at 17:39
I would try docs.rs/fasthash/latest/fasthash/farm/struct.Hash128.html — Peter Blackson, CommentedDec 15, 2023 at 17:57
@PeterBlackson I don't expect them to unfortunately. I'll take a look at FastHash, thanks!, I didn't know about it! I was considering xxhash as an alternative for now — Tristan Nemoz, CommentedDec 18, 2023 at 15:36

Stack Exchange Network

Assign unique identifiers to binary matrices using a binary tree

Context

The code

0

Hot Network Questions

Assign unique identifiers to binary matrices using a binary tree

Context

The code

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions