# README

Important information (we hope).

## AI for TSP Competition

https://www.tspcompetition.com/

In a Travelling Salesman Problem (TSP), the goal is to find the tour with the smallest cost visiting all locations ( customers) in a network exactly once.

However, in practical applications, one rarely knows all the travel costs between locations precisely. Moreover, there could be specific time windows at which customers need to be served, and specific customers can be more valuable than others. Lastly, the salesman is often constrained by a maximum capacity or travel time, representing a limiting factor in the number of nodes that can be visited.

In this competition, we consider a more realistic version of the classical TSP, i.e., the time-dependent orienteering problem with stochastic weights and time windows (TD-OPSWTW). In this formulation, the stochastic travel times between locations are only revealed as the salesman travels in the network. The salesman starts from a depot and must return to the depot at the end of the tour. Moreover, each node (customer) in the network has its prize, representing how important it is to visit a given customer on a tour. Each node has associated time windows. We consider that a salesman may arrive earlier at a node without compromising its prize, but the salesman has to wait until the opening times to serve the customer. Lastly, the salesman must not violate a total travel time budget while collecting prizes in the network. The goal is to collect the most prizes in the network while respecting the time windows and the total travel time of a tour allowed to the salesman.

For ease of discussion, we have created a Slack chat workspace. You can join us here if you have any questions, or use our Github repository for issues.

### Dependencies

Python=3.8 (should be OK with v >= 3.6)

PyTorch=1.8 (track 2 only)

Numpy=1.20

bayesian-optimization=1.1.0 (track 1 only)

Pandas=1.2.4

Conda=4.8.4 (optional)

Please check `environment.yml`

### Competition Rules

Participants need to form

**teams**(up to 5 participants) to be able to participate.If a single person wants to participate, a team of one person must be formed.

Participants must

**register**their teams via the website.Submissions must be made through the website.

Submissions for track 1 must make use of a surrogate model in one way or another, and submissions for track 2 must make use of reinforcement learning in one way or another.

Submissions must contain the

**source code**of your submission (`.zip`

) and the**relevant output files**:`.out`

(track 1), and a`.zip`

file containing**a single**`.json`

file (track 2). The source code will not be made public without your consent but will be used to verify the method used to obtain the solution.The competition is composed of two phases:

**validation**and**test**.In the

**validation**phase, all participants are evaluated with the same validation instance (track 1) or validation set (track 2).The final submission deadline for the validation phase is on the

**5th of July, 2021**, Anywhere on Earth (AoE) time.In the

**test**phase, teams will be asked to submit a new submission file and codes for test instances.For the test phase, teams will have

**1 week**to submit their**codes**and**submission files**to the website.The deadline for the test phase is on the

**12th of July, 2021**(AoE).The organizers will thoroughly check the codes of the teams in the test phase regarding the

**solution approach**and**submission scores**.If discrepancies of more than

**5%**between the results submitted and the results obtained running the code are found, the team will be disqualified. Please make sure your results are reproducible.**Winners**will be contacted on the**9th of August, 2021**.Winners will be announced publicly during the (DATA SCIENCE MEETS OPTIMISATION)

**DSO workshop @IJCAI-2021 (21st of August, 2021)**.In the case of a tie (track 1), the number of Monte Carlo samples used to determine the score is increased by a factor of 10.

If a tie still persists, and

there is a tie between multiple teams for first place, both the prize money for first and second place will be divided among the winning teams. Teams that did not achieve the highest score do not receive any prize money.

there is only a tie between multiple teams for second place, the first team earns the prize money of the first place as normal. The prize money for second place will be divided among the second-placed teams.

We consider **two tracks** in the competition: Surrogate-based Optimization (track 1) and Reinforcement Learning (track 2).

### Track 1: Surrogate-based Optimization

In the surrogate-based track, the goal is to find the tour s that gives the highest reward for one instance i:

`s* = argmax E[f(s,i)]`

We use the expected value because the simulator is stochastic: it can give different rewards even if the same route is evaluated multiple times. The expected value for a route is approximated by evaluating the objective `f`

for that route ten thousand times and calculating the average reward. This computation takes multiple seconds on standard hardware. Therefore, the problem can be seen as an expensive optimization problem. Our goal is to solve this problem using a surrogate-based optimization method such as Bayesian optimization.

In the end, given an instance, participants should submit the route that they believe gives the highest reward for that instance.

#### Objective

For calculating the reward of a route, we first need to create the instance. This can be done by creating an environment.

This code also appears in `baseline_surrogate/demo_surrogate.py`

. When the main competition starts, the exact number of nodes and the exact random seed will be given, so the instance is known.

To evaluate a solution, we need to make sure the solution is in the correct format. A solution always has the form `[1,s_1,...,s_n]`

, with `n`

the number of nodes. The numbers `s_1,...,s_n`

need to contain all integers from 1 to n. This means that the number 1 will appear twice in the solution. As this number indicates the starting node, it means that the route consists of starting from the starting node, visiting any number of nodes, then returning to the starting node at some point. Any nodes that appear in the solution after returning to the starting node are ignored in the environment. A solution can be run in the environment with the following code:

The term `rewards+pen`

is the objective that needs to be optimized, where `pen`

is a negative reward that comes from violating constraints in the problem. However, as mentioned earlier, we evaluate the objective ten thousand times and take the average, as the objective can differ even for the same solution due to the random noise present in the environment. The exact objective is given in the function `objective`

, which requires a solution and an environment as input and provides the averaged objective as output:

In the competition, we provide the instance, and participants are expected to submit the solution that they believe gives the best (highest) objective for this instance. An example solution submission is given in the file `check_solution_surrogate.py`

. The file submitted on the website should simply be a `.out`

file that contains the numbers of the solution on separate lines, e.g.:

Example output files can be found in the folder `baseline_surrogate\`

.

#### Using continuous solvers: an example with Bayesian optimization with Gaussian processes

Though the optimization problem is discrete, it is possible to apply continuous optimization algorithms (such as Bayesian optimization with Gaussian processes) to this problem by rounding solutions. We provide code for this in `baseline_surrogate/demo_surrogate.py`

, using a similar transformation of the search space as in 'Black-box combinatorial optimization using models with integer-valued minima, Laurens Bliek, Sicco Verwer & Mathijs de Weerdt' . In this approach, we let the first variable denote any of the n nodes, the second variable any of the nodes that are remaining, and so on until only one node is remaining. A continuous solver then only needs to use the right upper and lower bounds, and then after rounding we can transform these values into a valid solution with the following code:

The objective for a continuous solver such as Bayesian optimization is then given by:

For the lower and upper bounds we need to be careful with the rounding, so instead of having the lower bound be 0 and the upper bound the remaining number of cities, we add and subtract a small number:

Finally, we can call the Bayesian optimization method as follows after installing it with `conda install -c conda-forge bayesian-optimization`

:

Other surrogate algorithms might be able to use the `objective`

function with its discrete input directly, or require to transform the objective in a different way than the function `objBO`

shown above. Participants are free to use their own objective function, even multi-fidelity ones or using different ways to incorporate constraints, but the one that will be used for evaluating a solution is the function `objective`

.

### Track 2: Reinforcement Learning

In the Reinforcement Learning (RL) track, we are interested in a **policy**.

**Policy**: A policy in the TD-OPSWTW selects the next node to be visited given a sequence of previously visited nodes. To cope with the stochastic travel times, a **policy must be adaptive**. Therefore, a policy needs to take as input the instance information to construct tours dynamically that respect the time windows of nodes and the total tour time allowed for the instance. Note that the competition's goal is to learn how to act when there is uncertainty in the travel times. Suppose you transform the problem into a deterministic one every time a new simulation is started (by having complete information of the sampled travel times). In that case, this **will not** be considered adaptive. Thus, the proposed solutions should take the instance features as input (including the maximum travel times between locations) and use the information revealed to the policy as you build the tour node-by-node. Note that unlike Track 1, we are interested in general policies applicable to any instance of the TD-OPSWTW in the training distribution. The following figure shows an example of a next node visitation decision that has to be made by a policy visiting `n=6`

nodes.

In the figure, a policy has visited nodes 1 (depot) and 6 with travel time `t_{1,6}`

revealed after visiting node 6. At this current decision epoch, the policy has to choose the next node to visit. The prizes `p_i`

and time window bounds `[l_i, u_i]`

are known and given in the instance, as well as the maximum allowed tour time `T`

. The decision should consider the prizes of each node, the time windows, and the total remaining travel time when selecting the next node (in this case, node 3). Note that your policy can utilise the information of the travel time `t_{1,6}`

for the next visitation decision.

To achieve a feasible solution, a policy needs to visit nodes respecting the upper bounds of the time windows, and the maximum tour times. When the policy decides to arrive early at a node, the travel time gets shifted to the beginning of the time window. For example, if the travel time between the depot (node 1) and node 6 is lower than `l_6`

, the salesman has to wait until `l_6`

to depart from that node. This information becomes available as soon as the salesman arrives at node 6. Lastly, a policy must always return to the depot, and this travel time is also included in the maximum allowed tour time.

#### Environment

The RL environment is implemented on an instance by instance basis. That is, to initiate the environment, one has to pass an instance file. If an instance file is not given, the environment will create an instance on the fly. This behaviour can be helpful for training. One can call the RL environment as follows

Note that the previous call initiates an environment generating an instance on the fly with 5 nodes. Each instance is assigned an `instance_name`

and a `name`

. Both are used in the submission file that is used to score submissions. Every time the environment is `reset()`

, new travel times are drawn at random (new Monte Carlo sample). We call each simulation, i.e., each randomness, a `tour000`

. During the evaluation, you will be given instances, seeds, and a number of tours to generate for each instance file.

The RL environment's default behaviour is `adaptive`

. This means that it expects a node as the input of each `step()`

call. However, we allow for the environment to be called in the same manner as Track 1 by setting `adaptive=False`

and calling `check_solution()`

as before.

To call the environment from an instance file:

Note that when the environment is initiated, the first simulation is already started, i.e., calling `reset()`

will create a second simulation, i.e., `tour002`

. You are not allowed to use `noisy_adj`

as input to your proposed solution as this would turn the problem into a deterministic one.

**Taking a step in the environment**

A tour in the environment always starts from the depot. Thus, the first call of the `step()`

method does not need to include the depot. By convention, node 1 is always the depot. Therefore, a tour is considered complete if node 1 is passed to the `step()`

method. After that, no other prizes, penalties, or travel times are incurred. Please see the example below:

**Travel times, prizes, penalties, and violations**

Each call of the `step()`

method returns several useful metrics for the task. In order: total travel time of a sequence, travel time between the previous node and the current node, prize collected at the current node, the penalty incurred at the current node, boolean of the feasibility after visiting the current node, type of violation, and a boolean of the completed tour.

Note that each violated constraint (not respecting the time windows from above or not respecting the maximum tour time) incur penalties. In other words, it is possible to submit only infeasible tours. However, these will be penalised in the final scores. Each type of violation is also identified as `0: no violation, 1: time window violation, 2: maximum tour time violation`

. The maximum tour time violation takes precedence over the time window violation and incurs the highest penalties. The prizes are between `[0, 1]`

and depend on the maximum travel time to the depot, with nodes farther from the depot assigned higher prizes. Penalties are as follows:

Violating a time window:

`-1.0`

Violating the maximum tour time:

`-1.0*n`

**Instance features**

To aid learning, one can make use of instance and (maximum) travel times between locations. We point out that these travel times are **not** exactly the travel times experienced in the simulations. One can recover the instance features containing: node coordinates in 2D (used to generate the simulations), time windows, prizes, and maximum tour time. See the example below:

In the above example the first row of `inst`

represents:

node:1,

coordinates: 65.0, 27.0

tw: [0, 339]

prize: 0

max tour time: 354

#### Baseline

We provide a baseline to the RL competition based on 'Neural Combinatorial Optimization with Reinforcement Learning, Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, Samy Bengio' . Note that the above approach does not adapt to the travel times observed in a given Monte Carlo simulation and will not perform well in the given task. This baseline is just a reference as to how RL can be used. Moreover, it only uses the coordinates and prizes to make decisions on complete tours.

**Calling the baseline:**

If you would like a reference to an adaptive method, consider 'Attention, Learn to Solve Routing Problems!, Wouter Kool, Herke van Hoof, Max Welling' (Stochastic PCTSP).

### Data

#### Training

Participants are free to use any form of training as long it includes a Reinforcement Learning method. At the end of the competition, we will invite the winners to submit their codes for compliance with the competition rules. If it becomes clear that the proposed method does not use any form of RL or the submitted results cannot be reproduced with the code provided, these teams will be disqualified.

#### Validation

We provide **1000 instances** as a validation set `instance0000.csv`

, `adj-instance0000.csv`

. Note that this validation set will be used throughout the competition to evaluate your submissions on the website. For each validation instance, participants have to generate **100** simulations and use their methods to generate tours. Each tour (Monte Carlo simulation) is assigned a name following `tour000`

. This naming scheme is already present in the environment. The instances have different nodes: 20, 50, 100, 200 (250 instances each).

**Validation instances:**

Participants can use any instances for training, but validation will always be done on the same instance sizes. Based on the validation instances, we will select the best performing teams. These teams will be given a final test dataset. This final test dataset will be used to compute the final scores and define the competition winners. Note that this test dataset will follow the same generation process as the validation dataset. The details about the test dataset will be revealed to the qualifying participants.

### Submission

A submission file example can be found in the `baseline_rl`

folder, named `example_output_rl.json`

. The submission file **is a .zip file containing a single **`.json`

file with the instance name, followed by the number of nodes, seed, and 100 simulated tours. Thus `tour001`

until `tour100`

.

For the validation data, all seeds are the same `seed: 12345`

. Each tour name must be followed by an array of integers of size `n+1`

. For a solution to be considered valid, the depot `node: 1`

must appear twice in every array. Moreover, the depot **must** appear in the first position. If the solutions do not comply with these standards, the submission will be invalid. When submitting to the website you must upload **a single zip** file containing a **single** `.json`

file.

### Scoring Submissions

We will score submissions considering the sum of prizes and penalties. That is, for each instance and tour (Monte Carlo simulation) we will compute `score = prizes+penalties`

. The scores of all Monte Carlo simulations will be averaged, and the submission with the highest average `score`

will be ranked highest. Note that all participants can compute their validation scores before submission. We encourage you to do so as this will minimise the submission errors. You can score a solution by calling

Last updated