Adding a few tests
Before performing any chance, it can be a good idea to add a few tests. This is especially relevant if we plan to perform optimisations as we'll be able to measure times. In that, case, it is also great when we are able to generate arbitrarily large inputs to infer the complexity of the solution (or at least be able to compare 2 solutions).
Here is what I wrote:
import time start = time.time() for i in range(2): # Examples provided assert Solution().kSimilarity("ab", "ba") == 1 assert Solution().kSimilarity("abc", "bca") == 2 assert Solution().kSimilarity("abac", "baca") == 2 assert Solution().kSimilarity("aabc", "abca") == 2 # Arbitrary examples assert Solution().kSimilarity("abcdefg", "gabdcef") == 5 assert Solution().kSimilarity("abcbcdefg", "gcbabdcef") == 6 # Big examples inspired from smaller ones n = 10 assert Solution().kSimilarity("ab" * n, "ba" * n) == 1 * n assert Solution().kSimilarity("abc" * n, "bca" * n) == 2 * n assert Solution().kSimilarity("abac" * n, "baca" * n) == 2 * n assert Solution().kSimilarity("aabc" * n, "abca" * n) == 2 * n assert Solution().kSimilarity("abcdefg" * n, "gabdcef" * n) == 5 * n print(time.time() - start)
Playing with the different values of n
below, we can see that the Solution becomes very slow very quickly (I stopped at n = 4
).
The timing suggest an exponential complexity.
Maybe there's something we can do about it with a different algorithm. We'll see this at the end but in the meantime, let's try to improve the already existing code.
Code review
The code looks good and uses great ideas that I've seen in other places.
A few details can be improved.
I'd call find_first_mismatch
directly from find_candidates
.
Actually, you could also merge the 2 functions to have a single loop and avoid messing with indices.
With a simple rewriting we get something equivalent which could be considered less elegant but is slightly faster:
def find_candidates(self, A, B): mismatch = None for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: yield Solution.swap(A, mismatch, i)
Another optimisation could be to avoid converting A to list many times by inlining the code for swap
. This is not great as far as code organisation goes but...
def find_candidates(self, A, B): mismatch = None chars = list(A) for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: # Swap, yield and revert swap chars[i], chars[mismatch] = chars[mismatch], chars[i] yield "".join(chars) chars[i], chars[mismatch] = chars[mismatch], chars[i]
Renaming and reusing known values, we'd get:
def find_candidates(self, A, B): mismatch = None A_lst = list(A) for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: # Swap, yield and revert swap c = A_lst[mismatch] A_lst[i], A_lst[mismatch] = c, a yield "".join(A_lst) A_lst[i], A_lst[mismatch] = a, c
Instead of having q
via different ways (len, pop, appendleft), we could just use list and iterate over it while filling another list that will be used at next iteration. We'd get something like:
q = [A] distance = 0 seen = set([A]) while q: new_q = [] for current in q: if current == B: # print("Return:", distance) return distance for candidate in self.find_candidates(current, B): if candidate not in seen: seen.add(candidate) new_q.append(candidate) q = new_q distance += 1
Instead of maintaining distance at each iteration, you could use itertools.count
.
At this stage, the code looks like:
def find_candidates(self, A, B): mismatch = None A_lst = list(A) for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: # Swap, yield and revert swap c = A_lst[mismatch] A_lst[i], A_lst[mismatch] = c, a yield "".join(A_lst) A_lst[i], A_lst[mismatch] = a, c def kSimilarity(self, A, B): """ :type A: str :type B: str :rtype: int """ # print("kSimilarity:", A, B) q = [A] seen = set([A]) for distance in itertools.count(): new_q = [] for current in q: if current == B: return distance for candidate in self.find_candidates(current, B): if candidate not in seen: seen.add(candidate) new_q.append(candidate) q = new_q
Different algorithm
Generating so many intermediate strings is expensive and not so efficient.
An observation is that whenever we have elements so that: x1 should become x2, x2 should become x3, ..., xn should become x1, there is a trivial manipulation involving n - 1
swaps which put n
elements in their correct position. It is easy to see that things can't get more efficient that this.
Now, the initial problem can be rewritten in a different problem where we want to find such cycles in our original strings.
This can be done by reorganising data from our strings into a graph and writing the proper loop to visit nodes to find cycles.
class Solution: def kSimilarity(self, A: str, B: str) -> int: # Building a graph where c1 -> c2 means we want to change c1 for c2 # The problem becomes a graph problem: finding cycles changes = dict() for i, (c1, c2) in enumerate(zip(A, B)): if c1 != c2: changes.setdefault(c1, []).append(c2) ret = 0 while changes: # Find cycle visited = [] node = next(iter(changes)) while node not in visited: visited.append(node) node = changes[node][0] # Cycle is found - starting from node beg = visited.index(node) loop = visited[beg:] # print("Loop:", loop) ret += len(loop) - 1 # Remove corresponding elements for c1, c2 in zip(loop, loop[1:] + [loop[0]]): l = changes[c1] l.remove(c2) if not l: del changes[c1] # print("Return:", ret) return ret
It seems like this behaves in a linear time or something that look similar enough. I was able to run the tests above with n = 3000
in very small time.
Note: things are not as good as expected. Trying to run the code on the leetcode tool, a test case is not handled properly
assert Solution().kSimilarity("aabbccddee", "cdacbeebad") == 6 # Returns 7
I guess I based the algorithm on a wrong assumption. It seems like the cycles can't be chosen randomly as I assumed.
For instance, in that case, we should choose:
- [bc] + [de] + [bad] + [cea] (with cost 6)
over combinations like
At the end of the day, we want to maximise the number of cycles: the final cost will be number of characters in the wrong space - number of cycles
.
I have no clue how to handle this...