What would the "state space" and its Python implementation be for my simulation?

Question

Context

I'm trying to build a social-consensus simulation involving two intelligent agents. The simulation involves a graph/network of nodes. Nearly all of these nodes (> 90%) will be green agents. The remaining nodes will consist of one red agent, one blue agent, and some number of grey agents. The red agent and blue agent will be the only intelligent agents in the simulation, and their relationship is adversarial.

I'm representing the node data structure with the following Node class:

class Node(): def __init__(self, id, team): self.id = id self.team = team self.agent = None self.edges = [] if self.team == "green": self.agent = agents.GreenAgent() elif self.team == "blue": self.agent = agents.BlueAgent() elif self.team == "red": self.agent = agents.RedAgent()

The graph/network is instantiated to consist only of green agents. When the graph/network is instantiated, edges are added between the nodes based on some probability. So these edges represent the connections between green agents / nodes containing green agents, and they allow for interaction. These edges are undirected, and there are no loops (meaning nodes do not have edges to themselves).

I'm representing the graph/network data structure with the following Network class:

class Network(): def __init__(self, number_of_green_nodes, probability_of_an_edge): self.number_of_green_nodes = number_of_green_nodes self.number_of_nodes_overall = number_of_green_nodes self.probability_of_an_edge = probability_of_an_edge self.green_network = [Node(id = i, team = "green") for i in range(1, number_of_green_nodes + 1, 1)] self.total_number_of_grey_nodes_inserted = 0 all_pairs_of_nodes = combinations(self.green_network, 2) for node1, node2 in all_pairs_of_nodes: if random.random() < self.probability_of_an_edge: node1.create_edge(node2)

The green agents represent ordinary people. They will have two attributes: voting, which can be true or false, and uncertainty, which is some number that represents how uncertain the individual is about whether they are voting (say, -1 to +1, or 0 to 1). The green agents are not intelligent, and, in each round of the simulation, when it is their turn, they simply generically interact with each other. If it is the green agents' turn to interact, then all green agents with an edge/connection interact with each other, so there is no selectivity. But, with that said, they themselves do not initiate interaction with the red, blue, or grey agents. When the green agents interact with each other, their opinion on whether or not to vote, and their uncertainty, changes based on some calculation.

My green agent class is represented partially as follows:

class GreenAgent(): def __init__(self): self.voting = random.random() < 0.5 self.uncertainty = scipy.stats.truncnorm.rvs(0, 1) def interact_with_green_agent(self, other_agent_opinion, other_agent_uncertainty): if other_agent_opinion != self.voting: if other_agent_uncertainty < self.uncertainty: self.voting = not self.voting self.uncertainty = self.uncertainty + (1 - other_agent_uncertainty/self.uncertainty)*(1 - self.uncertainty) else: if other_agent_uncertainty < self.uncertainty: self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty/self.uncertainty)*(self.uncertainty - other_agent_uncertainty)

The red agent and blue agent are added to the network after the network is initialised with the green agents and their edges. The red agent and blue agent do not have an edge to each other (they do not interact with each other), but they do have an edge to every single green agent/node (so, unlike between the green agents/nodes, it is not based on some probability), so the red agent and blue agent are initially able to interact with all green nodes/agents.

The red agent and blue agent are adversaries. The red agent's goal is to convince a majority of the green agents not to vote, and to have these green agents be more certain of not voting than uncertain (based on the green agents' uncertainty attribute). On the other hand, the blue agent's goal is to convince a majority of the green agents to vote, and to have these green agents be more certain of voting than uncertain (again, based on the green agents' uncertainty attribute. The red/blue agents cannot have their "opinion" changed with regards to voting, so the red agent will always have attribute voting as false, and the blue agent will always have attribute voting as true.

When it is the red agent's turn in the simulation, the red agent can, initially, interact with all of the green agents/nodes. And, just like the case of green-green interaction, the red agent will interact with all green agents that it has an edge/connection to – there is no selectivity. When it interacts with the green agents, it can select from 5 levels of propaganda, with each level of propaganda becoming increasingly radical/extreme. The trade-off here is that, the more radical/extreme the propaganda from the red agent, the greater its uncertainty level becomes when disseminating the propaganda, and the more likely they are to permanently alienate green agents (represented by me as the removal of an edge).

My red agent class is represented as follows:

class RedAgent(): def __init__(self): self.voting = False self.uncertainty = None def messaging(self, potency): if potency == 1: self.uncertainty = random.uniform(0.0, 0.2) elif potency == 2: self.uncertainty = random.uniform(0.2, 0.4) elif potency == 3: self.uncertainty = random.uniform(0.4, 0.6) elif potency == 4: self.uncertainty = random.uniform(0.6, 0.8) elif potency == 5: self.uncertainty = random.uniform(0.8, 1.0)

And my part of the green agent class that deals with the red agent is as follows:

def interact_with_red_agent(self, other_agent_uncertainty, this_green_agent_node, red_agent_node): if other_agent_uncertainty < 0.2: if self.voting == False and other_agent_uncertainty < self.uncertainty: self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) elif self.voting == True and other_agent_uncertainty < self.uncertainty: self.voting = False self.uncertainty = self.uncertainty + (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) if random.random() < other_agent_uncertainty/2: this_green_agent_node.remove_edge(red_agent_node) elif 0.2 < other_agent_uncertainty < 0.4: if self.voting == False and other_agent_uncertainty < self.uncertainty: self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) elif self.voting == True and other_agent_uncertainty < self.uncertainty: self.voting = False self.uncertainty = self.uncertainty + (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) if random.random() < other_agent_uncertainty/2: this_green_agent_node.remove_edge(red_agent_node) elif 0.4 < other_agent_uncertainty < 0.6: if self.voting == False and other_agent_uncertainty < self.uncertainty: self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) elif self.voting == True and other_agent_uncertainty < self.uncertainty: self.voting = False self.uncertainty = self.uncertainty + (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) if random.random() < other_agent_uncertainty/2: this_green_agent_node.remove_edge(red_agent_node) elif 0.6 < other_agent_uncertainty < 0.8: if self.voting == False and other_agent_uncertainty < self.uncertainty: self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) elif self.voting == True and other_agent_uncertainty < self.uncertainty: self.voting = False self.uncertainty = self.uncertainty + (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) if random.random() < other_agent_uncertainty/2: this_green_agent_node.remove_edge(red_agent_node) elif 0.8 < other_agent_uncertainty < 1: if self.voting == False and other_agent_uncertainty < self.uncertainty: self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) elif self.voting == True and other_agent_uncertainty < self.uncertainty: self.voting = False self.uncertainty = self.uncertainty + (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty) if random.random() < other_agent_uncertainty/2: this_green_agent_node.remove_edge(red_agent_node)

When it is the blue agent's turn in the simulation, the blue agent can interact with all of the green agents/nodes. And, again, the blue agent interacts with all green agents/nodes that it has an edge/connection to – so, again, there's no selectivity. Similar to the red agent, the blue agent can select from 5 levels of potency of messaging. The trade-off, however, is that the blue agent has an "energy level", and the more potent the message the blue agent chooses during its turn, the more likely that it loses energy. And, if the blue agent loses all of its energy, then it loses the simulation/game.

Furthermore, another option that the blue agent has during its turn is to insert a grey agent into the network. The grey agent, based on some probability, can either be an ally of the blue agent, helping to convince the green agents to vote, or can actually work against the blue agent, working for the red agent in further radicalising the green agents and convincing them not to vote. Depending on whether it is on the side of the blue agent or the red agent, the grey agent's interaction with the green agents can either mimic the red agent’s interaction abilities or the blue agent’s interaction abilities. However, the difference is that the grey agent does not suffer any of the consequences of taking action that the red/blue agent does during interaction: if the grey agent is an ally of the blue agent, then it can do the same move as the blue agent, without the blue agent losing any energy, but if the grey agent ends up working for the red agent, it can do the same move as the red agent, without the red agent having the chance of alienating any green agents. The choice by the blue agent to insert a grey agent into the network, instead of disseminating messaging, will take up the turn of the blue agent, regardless of whether the grey agent proves to work for the blue agent or against it.

My blue agent class is as follows:

class BlueAgent(): def __init__(self): self.voting = True self.uncertainty = None self.energy_level = 10 def messaging(self, potency): if potency == 1: self.uncertainty = random.uniform(0.8, 1.0) if random.random() > self.uncertainty: self.energy_level -= 1 elif potency == 2: self.uncertainty = random.uniform(0.6, 0.8) if random.random() > self.uncertainty: self.energy_level -= 1 elif potency == 3: self.uncertainty = random.uniform(0.4, 0.6) if random.random() > self.uncertainty: self.energy_level -= 1 elif potency == 4: self.uncertainty = random.uniform(0.2, 0.4) if random.random() > self.uncertainty: self.energy_level -= 1 elif potency == 5: self.uncertainty = random.uniform(0.0, 0.2) if random.random() > self.uncertainty: self.energy_level -= 1

And my part of the green agent class that deals with the blue agent is as follows:

def interact_with_blue_agent(self, other_agent_uncertainty): if other_agent_uncertainty < self.uncertainty: if self.voting != True: self.voting = True self.uncertainty = self.uncertainty - (1 - other_agent_uncertainty / self.uncertainty) * ( self.uncertainty - other_agent_uncertainty)

My grey agent class is as follows:

class GreyAgent(): def __init__(self): self.spy = random.random() < 0.5 self.uncertainty = None def lifeline(self, potency): if potency == 1: self.uncertainty = random.uniform(0.8, 1.0) elif potency == 2: self.uncertainty = random.uniform(0.6, 0.8) elif potency == 3: self.uncertainty = random.uniform(0.4, 0.6) elif potency == 4: self.uncertainty = random.uniform(0.2, 0.4) elif potency == 5: self.uncertainty = random.uniform(0.0, 0.2) def misinformation(self, potency): if potency == 1: self.uncertainty = random.uniform(0.0, 0.2) elif potency == 2: self.uncertainty = random.uniform(0.2, 0.4) elif potency == 3: self.uncertainty = random.uniform(0.4, 0.6) elif potency == 4: self.uncertainty = random.uniform(0.6, 0.8) elif potency == 5: self.uncertainty = random.uniform(0.8, 1.0)

The red and blue agents have access to the number of green agents that are voting / not voting, but they do not have access to the green agents' uncertainty levels. Furthermore, the red and blue agents have access to what action the other (red/blue) agent took during their turn. And both agents also have access to the action the grey agent took.

Question

I am now trying to make the red agent and blue agent intelligent. My thought was to use reinforcement learning, whereby each agent learns from previous rounds of the simulation. In particular, I'm currently looking at applying Q-Learning. The problem is that I'm not familiar with reinforcement learning, and I'm trying to learn it by reading tutorials and going through the textbook Reinforcement Learning: An Introduction, second edition, by Sutton and Barto.

So my idea is to apply Q-Learning to train the red agent and blue agent individually. My understanding is that, in order to do this, I need to define the "environment," where the "environment" consists of the (1) "state space," (2) "actions," and (3) "rewards." The current problem is that, despite studying a ton of tutorials, I still can't figure out what my "state space" is supposed to be here (for the red agent and blue agent individually), and nor can I figure out what its Python implementation would be. What is the "state space" here, and what would its Python implementation be?

You may want to work on brevity to get answers. This is 2k words, according to wc. — maxy, CommentedOct 14, 2022 at 23:27

maxy · Accepted Answer · 2022-10-15 00:49:29Z

I'll try to answer the general, conceptual question. For simplicity I assume that you just optimize one agent at a time.

Reinforcement Learning

Sutton and Barto mostly talks about small, fully-observable, discrete state-spaces. (Plus all the basic RL concepts.)

Most practical RL problems are not fully observable (in the Markov sense), meaning there is some decision-relevant hidden state that the agents can only guess from multiple past observations.

To stay within the RL concept, you want to design your "state" such that it summarizes everything the agent needs to know about the past interactions, in a "good enough" way to make optimal decisions. Then the agent itself doesn't need any memory, and the Markov property holds, and all the RL theory is applicable.

I think what this means in your case is that you need to summarize the past observable behaviour of the other agents somehow, and add that into the agent's state.

Black-Box Optimization

Alternatively, you could try framing this as black-box optimization problem. (In RL-terms: You optimize the cumulative return of each episode, not using the per-step rewards.)

The way to approach this is to ask yourself, "can I create a parametrizable model of a solution?", and then throw the parameters at one of the well-known optimization algorithms, like CMA-ES (for continuous parameters), or the Cross-Entropy Method (mostly for discrete parameters), or (if you are willing to do lots of parameter tuning) even a genetic algorithm.

If you can come up with a model that has only very few (say ~50) parameters, IMO just throw it at CMA-ES.

The question here is again how to represent the agent's inputs. Now you have the advantage that the agent can have any state/memory it wants, for example for tracking the other agent's history. (In the MDP setup, only the environment can have state/memory.) You could use a neural network, RNN, or even add a few to-be-optimized parameters that decide what goes into the memory, and what is taken out.

You may also be able to use a policy gradient (reinforcement learning) method, especially if you use a NN, and thus avoid black-box optimization and instead do gradient descent (usually faster, if applicable).

Stack Exchange Network

What would the "state space" and its Python implementation be for my simulation?

Context

Question

1 Answer 1

Reinforcement Learning

Black-Box Optimization

You must log in to answer this question.

Hot Network Questions

What would the "state space" and its Python implementation be for my simulation?

Context

Question

1 Answer 1

Reinforcement Learning

Black-Box Optimization

You must log in to answer this question.

Related

Hot Network Questions