My attempt at Dijkstra's Algorithm in Python 3

Question

I am practicing my coding at HackerRank. The exact problem statement can be found here, but to summarize, it is finding the distances of the shortest path from a starting node to every other node in a non-directional graph where every edge is weighted 6. If a node is unreachable, the distance is -1. The nodes are named 1 to n. If you want an example, please see the provided one in the link.

To solve this, I googled an explanation of Dijkstra's Algorithm and tried my best to implement it (I am new to graph problems). My program passes 6/7 of the test cases, but it times out on the 7th. I improved some small inefficiencies, but it wasn't enough. Please advise me on how I may make the following faster:

def shortest (start, nodes, adj): unique = [] #this part is pretty standard for this Algorithm I think distance = [float('inf')]*nodes for i in range(1, nodes+1): if (len(adj[i-1])>0): unique.append(i) distance [start-1] = 0 while (len(unique)>0): minDist = float('inf') for i in unique: if (distance[i-1] <= minDist): minDist = distance[i-1] distIndex = i current = distIndex #sets the current node as the closest unvisited node unique.remove(current) for i in adj[current-1]: #updates the distance of each neighbour of the current node neigh = i temp = 6 + distance[current-1] #edges are 6 if temp < distance[neigh-1]: distance[neigh-1] = temp return distance q = int(input()) #each test case contains multiple problems, so just consider one iteration for i in range(q): a = [] #for input purposes b = [] #for output purposes node = 0 edge = 0 #start of input node, edge = list(map(int, input().strip().split())) #the input begins with the number of nodes and edges in the graph for j in range(edge): a.append(list(map(int, input().strip().split()))) #reading the edges, which are all inputted next starting = int(input()) #reading the starting node #end of input adj = [[] for x in range(node)] #the adjacency matrix for j in a: #making the adjacency matrix from the inputted edges adj[j[0]-1].append(j[1]) adj[j[1]-1].append(j[0]) for j in adj: j = list(set(j)) #someone warned that the problematic test case gives the same edge many times, so this is to remove duplicates b = shortest(starting, node, adj) #formatting the output to what the question required b.remove(0) for i in range(len(b)): if b[i] == float('inf'): b[i] = -1 print (" ".join(str(e) for e in b))

If it helps, I 'purchased' the test case which is timing out.

Make sure you copied the code correctly. I get NameError: name 'node' is not defined. — Veedrac, CommentedAug 17, 2016 at 21:10
1. Sorry, I did it in the provided terminal at HackerRank and it compiled there. It seems that the node, edge line is not valid syntax elsewhere. When I change it to read to a list first and then assign node and edge, the next error is for mixing spaces and tabs (it didn't matter in the HackerRank terminal again, sorry). I have to go right now, but I will try to clean it as soon as I get back. 2. The first link to HackerRank has a Sample Input, Sample Output, and Explanation on the page. The second link is just the input of the problematic test case. — Yifan, CommentedAug 17, 2016 at 21:31
Never mind, I'll just be a little late. I needed to declare node and edge first (not what I originally thought), and I used a text editor to change the indentation to be consistent with spaces. Please try again. — Yifan, CommentedAug 17, 2016 at 21:40
Now it breaks because adj is length 0 on the first iteration. — Veedrac, CommentedAug 17, 2016 at 22:04
You probably just needed to swap the adj = and the node, edge = lines. — Veedrac, CommentedAug 17, 2016 at 22:06

Veedrac · Accepted Answer · 2016-08-18 18:05:33Z

Those comments are very difficult to read.

j = list(set(j)) #someone warned that the problematic test case gives the same edge many times, so this is to remove duplicates

Format this properly:

# The problematic test case gives the same edge many times, # so this is to remove duplicates j = list(set(j))

Then remove fluff:

# Remove duplicates. Helps some test cases. j = list(set(j))

In this case just remove the first comment, because list(set(...)) is already known to mean deduplicate. And remove the second since it's implied by the fact you've written the code.

j = list(set(j))

Another pair is

a = [] # for input purposes b = [] # for output purposes

Don't do this. Just call them

descriptive_name_1 = [] descriptive_name_2 = []

so you don't need the comment in the first place.

Then there's q = .... So I wrack my brain thinking about what word contains q. But it turns out you just meant to write num_test_cases. What does that have to do with the letter q?

Worse than that is when you outright lie. What would you expect the variable node to contain? A node, right? Nope - it contains the total number of nodes.

And you write

node = 0

so we expect the total number of nodes to be 0 at that point, right? Nope, you immediately change your mind and write

node, edge = list(map(int, input().strip().split()))

But OK, fine. At least we know what nodes contains, then: multiple nodes, so a list (or other collection) of lengths of nodes.

Nope, it actually means exactly the same thing as node.

A similar comment goes for edge, which should be num_edges.

Then there's adj, which you label "adjacency matrix". Why not just call it adjacency_matrix then? But it's not even a dense matrix like that would imply, it's actually a list of per-node adjacencies, so call it adjacencies.

Then there's starting and start, which are the same thing named differently. Call them start_node or at least be consistent.

Going back to b, you have

b = []

but again this is a lie, because actually

b = shortest(start_node, num_nodes, adjacencies)

So just write that.

distances = shortest(start_node, num_nodes, adjacencies)

For a, initialize it as close to usage as possible and call it edges.

If you write

for j in range(num_edges):

you're misleading me into thinking you're going to use the index. When you're not going to, write

for _ in range(num_edges):

And what's up with that, followed by

for j in edges:

followed by

for j in adjacencies:

Is j just your general "this is a thing" variable? Most people use x, elem and item for that. Instead, write

for _ in range(num_edges): for edge in edges: for adjacent in adjacencies:

The middle can be

for start, end in edges:

The comment

#edges are 6

Isn't coherent. Try

# Edges are length 6

Instead, though, don't put this information in shortest - use 1 and multiply out when using the results.

Don't call variables temp. It's a poor name.

You've double-indented the part where you update the neighbours.

Note that

num_nodes, num_edges = list(map(int, input().strip().split()))

is just

num_nodes, num_edges = map(int, input().strip().split())

Why do you do this?

for i in adjacencies[current-1]: neigh = i

Just write

for neighbour in adjacencies[current-1]:

And the content can be

for neighbour in adjacencies[current-1]: distance[neighbour-1] = min(distance[neighbour-1], distance[current-1] + 1)

Do you not think this whole -1 thing is getting a bit silly? When reading, just decrement the edge values and starting edge correctly:

edges = [] for _ in range(num_edges): start, end = map(int, input().strip().split()) edges.append((start - 1, end - 1)) start_node = int(input()) - 1

while loops and if statements don't need their condition parenthesized, and operators should be spaced properly:

while (len(unique)>0): # before while len(unique) > 0: # after

Then this is better as just while unique.

Don't put spaces before the parentheses in function calls or indexing.

minDist should be min_dist. You can also start with it set to num_nodes, and the same for distance's initialization.

unique is a horrible name, and has nothing to do with the variable's purpose. Try unvisited instead. It can be initialized as

unvisited = [i for i, adj in enumerate(adjacencies) if adj]

distance, unsurprisingly, does not mean distance but instead distances. Fix that.

This code:

min_dist = num_nodes for i in unvisited: if distances[i] <= min_dist: min_dist = distances[i] distIndex = i current = distIndex

is just

min_dist = num_nodes for i in unvisited: if distances[i] <= min_dist: min_dist = distances[i] current = i

and current should be called closest. It is simplifiable to

closest = min(unvisited, key=lambda x: distances[x])

This is faster if you write

closest = min(unvisited, key=distances.__getitem__)

but the difference isn't vital.

One last thing with shortest - rename it to shortest_distances.

Put the rest of the code in a main function.

If you write

def read_pair(decrement): x, _, y = input().partition(" ") return int(x) - decrement, int(y) - decrement

then you can initialize num_nodes, num_edges and edges with

num_nodes, num_edges = read_pair(0) edges = [read_pair(1) for i in range(num_edges)]

Note my use of partition instead of split as IMO it's a better description of the operation here.

Then

for adjacent in adjacencies: adjacent = list(set(adjacent))

doesn't actually work! adjacent = only affects the local scope! Instead, you want

adjacencies = [list(set(adj)) for adj in adjacencies]

which is even better as

adjacencies = [set() for _ in range(num_nodes)] for start, end in edges: adjacencies[start].add(end) adjacencies[end].add(start)

as you never have the large intermediates then, and there's no real point converting back from a set.

This stuff:

distances.remove(0) for i in range(len(distances)): if distances[i] == num_nodes: distances[i] = -1 else: distances[i] *= 6 print(" ".join(str(e) for e in distances))

is just

print(*(-1 if i == num_nodes else i * 6 for i in distances if i))

Nice, that seems to have made it fast enough, but more important the code is more readable.

We can improve speed further by avoiding input, using sys.stdin directly to get buffering.

def main(lines): def read_pair(decrement): x, _, y = next(lines).partition(" ") return int(x) - decrement, int(y) - decrement ... main(sys.stdin)

Note that this change is sufficient on its own to beat the time limit: you can apply it to your original code to pass the test. Note also that you shouldn't mix input with next; at least on Python 2 this throws an error saying

ValueError: Mixing iteration and read methods would lose data

Though this error is gone in Python 3, that doesn't leave me feeling good about it.

I'd finish off with

from __future__ import print_function

to make it Python 2 compatible.

from __future__ import print_function import sys def shortest_distances(start_node, num_nodes, adjacencies): distances = [num_nodes] * num_nodes unvisited = [i for i, adj in enumerate(adjacencies) if adj] distances[start_node] = 0 while unvisited: closest = min(unvisited, key=distances.__getitem__) unvisited.remove(closest) # Update the distances of each neighbour for neighbour in adjacencies[closest]: distances[neighbour] = min(distances[neighbour], distances[closest] + 1) return distances def main(lines): def read_pair(decrement): x, _, y = next(lines).partition(" ") return int(x) - decrement, int(y) - decrement num_test_cases = int(next(lines)) for i in range(num_test_cases): num_nodes, num_edges = read_pair(0) edges = [read_pair(1) for i in range(num_edges)] start_node = int(next(lines)) - 1 adjacencies = [set() for _ in range(num_nodes)] for start, end in edges: adjacencies[start].add(end) adjacencies[end].add(start) distances = shortest_distances(start_node, num_nodes, adjacencies) print(*(-1 if i == num_nodes else i * 6 for i in distances if i)) main(sys.stdin)

Wow, thank you very much. This is very thorough and helpful. — Yifan, CommentedAug 18, 2016 at 13:51

Community · Accepted Answer · 2017-04-13 12:40:41Z

Veedrac's answer has a lot of good points about the quality of the code, so in this answer I'm just going to discuss the performance.

Let's figure out where the time goes by profiling the program using Python's built-in profiler:

$ python -m cProfile -s tottime cr138989.py < testcase.txt ... 1492053 function calls in 3.915 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 1.767 1.767 3.915 3.915 cr138989.py:1(<module>) 246223 1.346 0.000 1.348 0.000 {built-in method builtins.input} 6 0.343 0.057 0.364 0.061 cr138989.py:1(shortest) 246216 0.221 0.000 0.221 0.000 {method 'split' of 'str' objects} 741011 0.106 0.000 0.106 0.000 {method 'append' of 'list' objects} 246216 0.062 0.000 0.062 0.000 {method 'strip' of 'str' objects} 6 0.043 0.007 0.043 0.007 {built-in method builtins.print} 2387 0.020 0.000 0.020 0.000 {method 'remove' of 'list' objects} 3403 0.002 0.000 0.002 0.000 cr138989.py:53(<genexpr>) 6 0.001 0.000 0.001 0.000 cr138989.py:39(<listcomp>) 258 0.001 0.000 0.001 0.000 {built-in method _codecs.ascii_decode} 6 0.001 0.000 0.002 0.000 {method 'join' of 'str' objects} 5796 0.001 0.000 0.001 0.000 {built-in method builtins.len} 258 0.001 0.000 0.002 0.000 ascii.py:25(decode) 258 0.000 0.000 0.000 0.000 codecs.py:280(getstate) 1 0.000 0.000 3.915 3.915 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

It takes some practice to interpret the results, but basically this is telling you that the total runtime was 3.9 seconds, of which 1.7 seconds was spent in the code at the top level of the module, 1.3 seconds was spent in input, 0.3 in shortest, and a small amount in other functions.

So the obvious performance problem here is input. If you read the documentation, you'll see that this function has several features that you don't use (prompting, interactivity, newline stripping, EOF handling), and all this can easily be avoided by replacing input with sys.stdin.readline:

import sys input = sys.stdin.readline

Now the profile results look like this:

 ncalls tottime percall cumtime percall filename:lineno(function) 1 1.500 1.500 2.387 2.387 cr138989.py:1(<module>) 6 0.357 0.059 0.378 0.063 cr138989.py:1(shortest) 246216 0.151 0.000 0.151 0.000 {method 'split' of 'str' objects} 246223 0.140 0.000 0.142 0.000 {method 'readline' of '_io.TextIOWrapper' objects} 741011 0.103 0.000 0.103 0.000 {method 'append' of 'list' objects} 246216 0.058 0.000 0.058 0.000 {method 'strip' of 'str' objects} 6 0.052 0.009 0.052 0.009 {built-in method builtins.print} 2387 0.020 0.000 0.020 0.000 {method 'remove' of 'list' objects}

This is already fast enough to pass the HackerRank time limit. But you can save another 0.6 seconds or so by making these changes:

Drop the calls to strip: there's no need for them, because split with no arguments has the same effect.
There's no need to construct the list a of edges: it's simpler just to construct the adjacency matrix directly from the input.
There's no need to deduplicate the edges (duplicate edges don't affect the correctness of Dikjstra's algorithm), but if you are going to do it, it's simpler to make adj into a list of sets, and call the add method instead of append so that the deduplication happens as you go along.

Stack Exchange Network

My attempt at Dijkstra's Algorithm in Python 3

2 Answers 2

Hot Network Questions

My attempt at Dijkstra's Algorithm in Python 3

2 Answers 2

Related

Hot Network Questions