(Leetcode) K-similar strings in Python

Question

Strings A and B are K-similar (for some non-negative integer K) if we can swap the positions of two letters in A exactly K times so that the resulting string equals B.
Given two anagrams A and B, return the smallest K for which A and B are K-similar.
Note -
1 <= A.length == B.length <= 20
A and B contain only lowercase letters from the set {'a', 'b', 'c', 'd', 'e', 'f'}

Here is my solution to this challenge -

import collections class Solution: def find_first_mismatch(self, A, B): for i in range(len(A)): if A[i] != B[i]: return i def find_candidates(self, A, B, mismatch): for i in range(mismatch + 1, len(A)): if A[i] != B[i] and A[i] == B[mismatch]: yield Solution.swap(A, mismatch, i) def swap(A, i, j): chars = list(A) chars[i], chars[j] = chars[j], chars[i] return "".join(chars) def k_similarity(self, A, B): """ :type A: str :type B: str :rtype: int """ q = collections.deque([A]) distance = 0 seen = set([A]) while q: for _ in range(len(q)): current = q.pop() if current == B: return distance mismatch = self.find_first_mismatch(current, B) for candidate in self.find_candidates(current, B, mismatch): if candidate not in seen: seen.add(candidate) q.appendleft(candidate) distance += 1

Here is my idea of how to solve the challenge -

The problem can be modeled as a graph problem of finding the shortest path between A and B where intermediate nodes are anagrams that can be formed by swapping 2 characters in a node having an edge to it. In other words, there is an edge between node X to node Y if Y can be formed by swapping 2 characters in X and vice-versa.

Here are some example outputs -

Example 1 -

Input: A = "ab", B = "ba" Output: 1

Example 2 -

Input: A = "abc", B = "bca" Output: 2

Example 3 -

Input: A = "abac", B = "baca" Output: 2

Example 4 -

Input: A = "aabc", B = "abca" Output: 2

Here is my Leetcode result (54 test cases) -

So, I would like to know whether I could make this program shorter and more efficient.

Any help would be highly appreciated.

SylvainD · Accepted Answer · 2019-05-31 23:52:45Z

Adding a few tests

Before performing any chance, it can be a good idea to add a few tests. This is especially relevant if we plan to perform optimisations as we'll be able to measure times. In that, case, it is also great when we are able to generate arbitrarily large inputs to infer the complexity of the solution (or at least be able to compare 2 solutions).

Here is what I wrote:

import time start = time.time() for i in range(2): # Examples provided assert Solution().kSimilarity("ab", "ba") == 1 assert Solution().kSimilarity("abc", "bca") == 2 assert Solution().kSimilarity("abac", "baca") == 2 assert Solution().kSimilarity("aabc", "abca") == 2 # Arbitrary examples assert Solution().kSimilarity("abcdefg", "gabdcef") == 5 assert Solution().kSimilarity("abcbcdefg", "gcbabdcef") == 6 # Big examples inspired from smaller ones n = 10 assert Solution().kSimilarity("ab" * n, "ba" * n) == 1 * n assert Solution().kSimilarity("abc" * n, "bca" * n) == 2 * n assert Solution().kSimilarity("abac" * n, "baca" * n) == 2 * n assert Solution().kSimilarity("aabc" * n, "abca" * n) == 2 * n assert Solution().kSimilarity("abcdefg" * n, "gabdcef" * n) == 5 * n print(time.time() - start)

Playing with the different values of n below, we can see that the Solution becomes very slow very quickly (I stopped at n = 4).

The timing suggest an exponential complexity.

Maybe there's something we can do about it with a different algorithm. We'll see this at the end but in the meantime, let's try to improve the already existing code.

Code review

The code looks good and uses great ideas that I've seen in other places.

A few details can be improved.

I'd call find_first_mismatch directly from find_candidates.

Actually, you could also merge the 2 functions to have a single loop and avoid messing with indices.

With a simple rewriting we get something equivalent which could be considered less elegant but is slightly faster:

def find_candidates(self, A, B): mismatch = None for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: yield Solution.swap(A, mismatch, i)

Another optimisation could be to avoid converting A to list many times by inlining the code for swap. This is not great as far as code organisation goes but...

def find_candidates(self, A, B): mismatch = None chars = list(A) for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: # Swap, yield and revert swap chars[i], chars[mismatch] = chars[mismatch], chars[i] yield "".join(chars) chars[i], chars[mismatch] = chars[mismatch], chars[i]

Renaming and reusing known values, we'd get:

def find_candidates(self, A, B): mismatch = None A_lst = list(A) for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: # Swap, yield and revert swap c = A_lst[mismatch] A_lst[i], A_lst[mismatch] = c, a yield "".join(A_lst) A_lst[i], A_lst[mismatch] = a, c

Instead of having q via different ways (len, pop, appendleft), we could just use list and iterate over it while filling another list that will be used at next iteration. We'd get something like:

 q = [A] distance = 0 seen = set([A]) while q: new_q = [] for current in q: if current == B: # print("Return:", distance) return distance for candidate in self.find_candidates(current, B): if candidate not in seen: seen.add(candidate) new_q.append(candidate) q = new_q distance += 1

Instead of maintaining distance at each iteration, you could use itertools.count.

At this stage, the code looks like:

 def find_candidates(self, A, B): mismatch = None A_lst = list(A) for i, (a, b) in enumerate(zip(A, B)): if a != b: if mismatch is None: mismatch = i elif a == B[mismatch]: # Swap, yield and revert swap c = A_lst[mismatch] A_lst[i], A_lst[mismatch] = c, a yield "".join(A_lst) A_lst[i], A_lst[mismatch] = a, c def kSimilarity(self, A, B): """ :type A: str :type B: str :rtype: int """ # print("kSimilarity:", A, B) q = [A] seen = set([A]) for distance in itertools.count(): new_q = [] for current in q: if current == B: return distance for candidate in self.find_candidates(current, B): if candidate not in seen: seen.add(candidate) new_q.append(candidate) q = new_q

Different algorithm

Generating so many intermediate strings is expensive and not so efficient.

An observation is that whenever we have elements so that: x1 should become x2, x2 should become x3, ..., xn should become x1, there is a trivial manipulation involving n - 1 swaps which put n elements in their correct position. It is easy to see that things can't get more efficient that this.

Now, the initial problem can be rewritten in a different problem where we want to find such cycles in our original strings.

This can be done by reorganising data from our strings into a graph and writing the proper loop to visit nodes to find cycles.

class Solution: def kSimilarity(self, A: str, B: str) -> int: # Building a graph where c1 -> c2 means we want to change c1 for c2 # The problem becomes a graph problem: finding cycles changes = dict() for i, (c1, c2) in enumerate(zip(A, B)): if c1 != c2: changes.setdefault(c1, []).append(c2) ret = 0 while changes: # Find cycle visited = [] node = next(iter(changes)) while node not in visited: visited.append(node) node = changes[node][0] # Cycle is found - starting from node beg = visited.index(node) loop = visited[beg:] # print("Loop:", loop) ret += len(loop) - 1 # Remove corresponding elements for c1, c2 in zip(loop, loop[1:] + [loop[0]]): l = changes[c1] l.remove(c2) if not l: del changes[c1] # print("Return:", ret) return ret

It seems like this behaves in a linear time or something that look similar enough. I was able to run the tests above with n = 3000 in very small time.

Note: things are not as good as expected. Trying to run the code on the leetcode tool, a test case is not handled properly

assert Solution().kSimilarity("aabbccddee", "cdacbeebad") == 6 # Returns 7

I guess I based the algorithm on a wrong assumption. It seems like the cycles can't be chosen randomly as I assumed.

For instance, in that case, we should choose:

[bc] + [de] + [bad] + [cea] (with cost 6)

over combinations like

[acb] + [ade] + [cebd] (with cost 7) or
[acedb] + [ade] + [bc] (with cost 7)

At the end of the day, we want to maximise the number of cycles: the final cost will be number of characters in the wrong space - number of cycles.

I have no clue how to handle this...

Stack Exchange Network

(Leetcode) K-similar strings in Python

1 Answer 1

Linked

Hot Network Questions

(Leetcode) K-similar strings in Python

1 Answer 1

Linked

Related

Hot Network Questions