Refacter neighbourhood exploration

2021-04-15 20:12:05 +02:00
11 changed files with 142 additions and 455 deletions
--- a/docs/Summary.org
+++ b/docs/Summary.org
@@ -1,155 +0,0 @@
 #+TITLE: Práctica 1
 #+SUBTITLE: Metaheurísticas
 #+AUTHOR: Amin Kasrou Aouam
 #+DATE: 2021-04-19
 #+PANDOC_OPTIONS: template:~/.pandoc/templates/eisvogel.latex
 #+PANDOC_OPTIONS: listings:t
 #+PANDOC_OPTIONS: toc:t
 #+PANDOC_METADATA: lang=es
 #+PANDOC_METADATA: titlepage:t
 #+PANDOC_METADATA: listings-no-page-break:t
 #+PANDOC_METADATA: toc-own-page:t
 #+PANDOC_METADATA: table-use-row-colors:t
 #+PANDOC_METADATA: colorlinks:t
 #+PANDOC_METADATA: logo:/home/coolneng/Photos/Logos/UGR.png
 #+LaTeX_HEADER: \usepackage[ruled, lined, linesnumbered, commentsnumbered, longend]{algorithm2e}
 * Práctica 1
 ** Introducción
 En esta práctica, usaremos distintos algoritmos de búsqueda para resolver el problema de la máxima diversidad (MDP). Implementaremos:
 - Algoritmo /Greedy/
 - Algoritmo de búsqueda local
 ** Algoritmos
 *** Greedy
 El algoritmo /greedy/ añade de forma iterativa un punto, hasta conseguir una solución de tamaño m.
 En primer lugar, seleccionamos el elemento más lejano de los demás (centroide), y lo añadimos en nuestro conjunto de elementos seleccionados. A éste, añadiremos en cada paso el elemento correspondiente según la medida del /MaxMin/. Ilustramos el algoritmo a continuación:
 \begin{algorithm}
    \KwIn{A list $[a_i]$, $i=1, 2, \cdots, m$, that contains the chosen point and the distance}
    \KwOut{Processed list}
    $Sel = [\ ]$
    $centroid \leftarrow getFurthestElement()$
    \For{$i \leftarrow 0$ \KwTo $m$}{
        \For{$element$ in $Sel$}{
            $closestElements = [\ ]$
            $closestPoint \leftarrow getClosestPoint(element)$
            $closestElements.append(closestPoint)$
        }
        $maximum \leftarrow max(closestElements)$
        $Sel.append(maximum)$
    }
    \KwRet{$Sel$}
 \end{algorithm}
 *** Búsqueda local
 El algoritmo de búsqueda local selecciona una solución aleatoria, de tamaño /m/, y explora durante un número máximo de iteraciones soluciones vecinas.
 Para mejorar la eficiencia del algoritmo, usamos la heurística del primer mejor (selección de la primera solución vecina que mejora la actual). Ilustramos el algoritmo a continuación:
 \begin{algorithm}
    \KwIn{A list $[a_i]$, $i=1, 2, \cdots, m$, the solution}
    \KwOut{Processed list}
    $Solutions = [\ ]$
    $firstSolution \leftarrow getRandomSolution()$
    $Solutions.append(firstSolution)$
    $lastSolution \leftarrow getLastElement(neighbour)$
    $maxIterations \leftarrow 1000$
    \For{$i \leftarrow 0$ \KwTo $maxIterations$}{
        \While{$neighbour \leq lastSolution$}{
            $neighbour \leftarrow getNeighbouringSolution(lastSolution)$
            $Solutions.append(neighbour)$
            $lastSolution \leftarrow getLastElement(neighbour)$
        }
        $finalSolution \leftarrow getLastElement(Solutions)$
    }
    \KwRet{$finalSolution$}
 \end{algorithm}
 ** Implementación
 La práctica ha sido implementada en /Python/, usando las siguientes bibliotecas:
 - NumPy
 - Pandas
 *** Instalación
 Para ejecutar el programa es preciso instalar Python, junto con las bibliotecas *Pandas* y *NumPy*.
 Se proporciona el archivo shell.nix para facilitar la instalación de las dependencias, con el gestor de paquetes [[https://nixos.org/][Nix]]. Tras instalar la herramienta Nix, únicamente habría que ejecutar el siguiente comando en la raíz del proyecto:
 #+begin_src shell
 nix-shell
 #+end_src
 *** Ejecución
 La ejecución del programa se realiza mediante el siguiente comando:
 #+begin_src shell
 python src/main.py <dataset> <algoritmo>
 #+end_src
 Los parámetros posibles son:
 | dataset                              | algoritmo |
 | Cualquier archivo de la carpeta data | greedy    |
 |                                      | local     |
 También se proporciona un script que ejecuta 1 iteración del algoritmo greedy y 3 iteraciones de la búsqueda local, con cada uno de los /datasets/, y guarda los resultados en una hoja de cálculo. Se puede ejecutar mediante el siguiente comando:
 #+begin_src shell
 python src/execution.py
 #+end_src
 *Nota*: se precisa instalar la biblioteca [[https://xlsxwriter.readthedocs.io/][XlsxWriter]] para la exportación de los resultados a un archivo Excel.
 ** Análisis de los resultados
 Los resultados obtenidos se encuentran en el archivo /algorithm-results.xlsx/, procedemos a analizar cada algoritmo por separado.
 *** Algoritmo greedy
 #+CAPTION: Algoritmo greedy
 [[./assets/greedy.png]]
 El algoritmo greedy es determinista, por lo tanto la desviación típica es nula, dado que se ejecuta una única vez. El tiempo de ejecución varía considerablemente según el dataset:
 - Dataset con n=500: 7-10 segundos
 - Dataset con n=2000: 5-12 minutos
 La distancia total obtenida, por lo general, es inferior al algoritmo de búsqueda local, aunque no difiere significativamente.
 *** Algoritmo de búsqueda local
 #+CAPTION: Algoritmo de búsqueda local
 [[./assets/local.png]]
 El algoritmo de búsqueda local es estocástico, debido a que para la obtención de cada una de las soluciones se utiliza un generador de números pseudoaleatorio. El tiempo de ejecución varía considerablemente según el dataset:
 - Dataset con n=500: 1-2 minutos
 - Dataset con n=2000: 20-25 minutos
 La distancia total obtenida, por lo general, es superior al algoritmo greedy lo cual indica que la búsqueda local obtiene mejores resultados a expensas del tiempo de ejecución.
 Debido a nuestras limitaciones computacionales, las ejecuciones de este algoritmo se hicieron con 100 iteraciones máximas.
--- a/docs/Summary.pdf
+++ b/docs/Summary.pdf
--- a/docs/algorithm-results.xlsx
+++ b/docs/algorithm-results.xlsx
--- a/docs/assets/greedy.png
+++ b/docs/assets/greedy.png
--- a/docs/assets/local.png
+++ b/docs/assets/local.png
--- a/shell.nix
+++ b/shell.nix
@@ -2,11 +2,4 @@
 with pkgs;
-mkShell {
+mkShell { buildInputs = [ python39 python39Packages.pandas ]; }
  buildInputs = [
    python39
    python39Packages.numpy
    python39Packages.pandas
    python39Packages.XlsxWriter
  ];
 }
--- a/src/execution.py
+++ b/src/execution.py
@@ -1,98 +0,0 @@
 from glob import glob
 from subprocess import run
 from sys import executable
 from numpy import mean, std
 from pandas import DataFrame, ExcelWriter
 def file_list(path):
    file_list = []
    for fname in glob(path):
        file_list.append(fname)
    return file_list
 def create_dataframes():
    greedy = DataFrame()
    local = DataFrame()
    return greedy, local
 def process_output(results):
    distances = []
    time = []
    for element in results:
        for line in element:
            if line.startswith(bytes("Total distance:", encoding="utf-8")):
                line_elements = line.split(sep=bytes(":", encoding="utf-8"))
                distances.append(float(line_elements[1]))
            if line.startswith(bytes("Execution time:", encoding="utf-8")):
                line_elements = line.split(sep=bytes(":", encoding="utf-8"))
                time.append(float(line_elements[1]))
    return distances, time
 def populate_dataframes(greedy, local, greedy_list, local_list, dataset):
    greedy_distances, greedy_time = process_output(greedy_list)
    local_distances, local_time = process_output(local_list)
    greedy_dict = {
        "dataset": dataset.removeprefix("data/"),
        "media distancia": mean(greedy_distances),
        "desviacion distancia": std(greedy_distances),
        "media tiempo": mean(greedy_time),
        "desviacion tiempo": std(greedy_time),
    }
    local_dict = {
        "dataset": dataset.removeprefix("data/"),
        "media distancia": mean(local_distances),
        "desviacion distancia": std(local_distances),
        "media tiempo": mean(local_time),
        "desviacion tiempo": std(local_time),
    }
    greedy = greedy.append(greedy_dict, ignore_index=True)
    local = local.append(local_dict, ignore_index=True)
    return greedy, local
 def script_execution(filenames, greedy, local, iterations=3):
    script = "src/main.py"
    for dataset in filenames:
        print(f"Running on dataset {dataset}")
        greedy_cmd = run(
            [executable, script, dataset, "greedy"], capture_output=True
        ).stdout.splitlines()
        local_list = []
        for _ in range(iterations):
            local_cmd = run(
                [executable, script, dataset, "local"], capture_output=True
            ).stdout.splitlines()
            local_list.append(local_cmd)
        greedy, local = populate_dataframes(
            greedy, local, [greedy_cmd], local_list, dataset
        )
    return greedy, local
 def export_results(greedy, local):
    dataframes = {"Greedy": greedy, "Local search": local}
    writer = ExcelWriter(path="docs/algorithm-results.xlsx", engine="xlsxwriter")
    for name, df in dataframes.items():
        df.to_excel(writer, sheet_name=name, index=False)
        worksheet = writer.sheets[name]
        for index, column in enumerate(df):
            series = df[column]
            max_length = max(series.astype(str).str.len().max(), len(str(series.name)))
            worksheet.set_column(index, index, width=max_length + 5)
    writer.save()
 def main():
    datasets = file_list(path="data/*.txt")
    greedy, local = create_dataframes()
    populated_greedy, populated_local = script_execution(datasets, greedy, local)
    export_results(populated_greedy, populated_local)
 if __name__ == "__main__":
    main()
--- a/src/greedy.py
+++ b/src/greedy.py
@@ -1,58 +0,0 @@
 from pandas import DataFrame, Series
 def get_first_solution(n, data):
    distance_sum = DataFrame(columns=["point", "distance"])
    for element in range(n):
        element_df = data.query(f"source == {element} or destination == {element}")
        distance = element_df["distance"].sum()
        distance_sum = distance_sum.append(
            {"point": element, "distance": distance}, ignore_index=True
        )
    furthest_index = distance_sum["distance"].astype(float).idxmax()
    furthest_row = distance_sum.iloc[furthest_index]
    furthest_row["distance"] = 0
    return furthest_row
 def get_different_element(original, row):
    if row.source == original:
        return row.destination
    return row.source
 def get_closest_element(element, data):
    element_df = data.query(f"source == {element} or destination == {element}")
    closest_index = element_df["distance"].astype(float).idxmin()
    closest_row = data.loc[closest_index]
    closest_point = get_different_element(original=element, row=closest_row)
    return Series(data={"point": closest_point, "distance": closest_row["distance"]})
 def explore_solutions(solutions, data, index):
    closest_elements = solutions["point"].apply(func=get_closest_element, data=data)
    furthest_index = closest_elements["distance"].astype(float).idxmax()
    solution = closest_elements.iloc[furthest_index]
    solution.name = index
    return solution
 def remove_duplicates(current, previous, data):
    duplicate_free_df = data.query(
        "(source != @current or destination not in @previous) and \
        (source not in @previous or destination != @current)"
    )
    return duplicate_free_df
 def greedy_algorithm(n, m, data):
    solutions = DataFrame(columns=["point", "distance"])
    first_solution = get_first_solution(n, data)
    solutions = solutions.append(first_solution, ignore_index=True)
    for iteration in range(m - 1):
        element = explore_solutions(solutions, data, index=iteration + 1)
        solutions = solutions.append(element)
        data = remove_duplicates(
            current=element["point"], previous=solutions["point"], data=data
        )
    return solutions
--- a/src/local_search.py
+++ b/src/local_search.py
@@ -1,75 +0,0 @@
 from numpy.random import choice, seed, randint
 from pandas import DataFrame
 def get_row_distance(source, destination, data):
    row = data.query(
        """(source == @source and destination == @destination) or \
        (source == @destination and destination == @source)"""
    )
    return row["distance"].values[0]
 def compute_distance(element, solution, data):
    accumulator = 0
    distinct_elements = solution.query(f"point != {element}")
    for _, item in distinct_elements.iterrows():
        accumulator += get_row_distance(
            source=element,
            destination=item.point,
            data=data,
        )
    return accumulator
 def get_first_random_solution(n, m, data):
    solution = DataFrame(columns=["point", "distance"])
    seed(42)
    solution["point"] = choice(n, size=m, replace=False)
    solution["distance"] = solution["point"].apply(
        func=compute_distance, solution=solution, data=data
    )
    return solution
 def element_in_dataframe(solution, element):
    duplicates = solution.query(f"point == {element}")
    return not duplicates.empty
 def replace_worst_element(previous, n, data):
    solution = previous.copy()
    worst_index = solution["distance"].astype(float).idxmin()
    random_element = randint(n)
    while element_in_dataframe(solution=solution, element=random_element):
        random_element = randint(n)
    solution["point"].loc[worst_index] = random_element
    solution["distance"].loc[worst_index] = compute_distance(
        element=solution["point"].loc[worst_index], solution=solution, data=data
    )
    return solution
 def get_random_solution(previous, n, data):
    solution = replace_worst_element(previous, n, data)
    while solution["distance"].sum() <= previous["distance"].sum():
        solution = replace_worst_element(previous=solution, n=n, data=data)
    return solution
 def explore_neighbourhood(element, n, data, max_iterations=100000):
    neighbourhood = []
    neighbourhood.append(element)
    for _ in range(max_iterations):
        previous_solution = neighbourhood[-1]
        neighbour = get_random_solution(previous=previous_solution, n=n, data=data)
        neighbourhood.append(neighbour)
    return neighbour
 def local_search(n, m, data):
    first_solution = get_first_random_solution(n, m, data)
    best_solution = explore_neighbourhood(
        element=first_solution, n=n, data=data, max_iterations=100
    )
    return best_solution
--- a/src/main.py
+++ b/src/main.py
@@ -1,61 +0,0 @@
 from preprocessing import parse_file
 from greedy import greedy_algorithm
 from local_search import local_search, get_row_distance
 from sys import argv
 from time import time
 from itertools import combinations
 def execute_algorithm(choice, n, m, data):
    if choice == "greedy":
        return greedy_algorithm(n, m, data)
    elif choice == "local":
        return local_search(n, m, data)
    else:
        print("The valid algorithm choices are 'greedy' and 'local'")
        exit(1)
 def get_fitness(solutions, data):
    accumulator = 0
    comb = combinations(solutions.index, r=2)
    for index in list(comb):
        elements = solutions.loc[index, :]
        accumulator += get_row_distance(
            source=elements["point"].head(n=1).values[0],
            destination=elements["point"].tail(n=1).values[0],
            data=data,
        )
    return accumulator
 def show_results(solutions, fitness, time_delta):
    duplicates = solutions.duplicated().any()
    print(solutions)
    print(f"Total distance: {fitness}")
    if not duplicates:
        print("No duplicates found")
    print(f"Execution time: {time_delta}")
 def usage(argv):
    print(f"Usage: python {argv[0]} <file> <algorithm choice>")
    print("algorithm choices:")
    print("greedy: greedy algorithm")
    print("local: local search algorithm")
    exit(1)
 def main():
    if len(argv) != 3:
        usage(argv)
    n, m, data = parse_file(argv[1])
    start_time = time()
    solutions = execute_algorithm(choice=argv[2], n=n, m=m, data=data)
    end_time = time()
    fitness = get_fitness(solutions, data)
    show_results(solutions, fitness, time_delta=end_time - start_time)
 if __name__ == "__main__":
    main()
--- a/src/processing.py
+++ b/src/processing.py
@@ -0,0 +1,141 @@
 from preprocessing import parse_file
 from numpy.random import choice, randint, seed
 from pandas import DataFrame, Series
 from sys import argv
 from time import time
 def get_first_solution(n, data):
    distance_sum = DataFrame(columns=["point", "distance"])
    for element in range(n):
        element_df = data.query(f"source == {element} or destination == {element}")
        distance = element_df["distance"].sum()
        distance_sum = distance_sum.append(
            {"point": element, "distance": distance}, ignore_index=True
        )
    furthest_index = distance_sum["distance"].astype(float).idxmax()
    furthest_row = distance_sum.iloc[furthest_index]
    return furthest_row
 def get_different_element(original, row):
    if row.source == original:
        return row.destination
    return row.source
 def get_closest_element(element, data):
    element_df = data.query(f"source == {element} or destination == {element}")
    closest_index = element_df["distance"].astype(float).idxmin()
    closest_row = data.loc[closest_index]
    closest_point = get_different_element(original=element, row=closest_row)
    return Series(data={"point": closest_point, "distance": closest_row["distance"]})
 def explore_solutions(solutions, data):
    closest_elements = solutions["point"].apply(func=get_closest_element, data=data)
    furthest_index = closest_elements["distance"].astype(float).idxmax()
    return closest_elements.iloc[furthest_index]
 def remove_duplicates(current, previous, data):
    data = data.query(
        f"(source != {current} or destination not in @previous) and (source not in @previous or destination != {current})"
    )
    return data
 def greedy_algorithm(n, m, data):
    solutions = DataFrame(columns=["point", "distance"])
    first_solution = get_first_solution(n, data)
    solutions = solutions.append(first_solution, ignore_index=True)
    for _ in range(m):
        element = explore_solutions(solutions, data)
        solutions = solutions.append(element)
        data = remove_duplicates(
            current=element["point"], previous=solutions["point"], data=data
        )
    return solutions
 def get_first_random_solution(m, data):
    seed(42)
    random_indexes = choice(len(data.index), size=m)
    return data.iloc[random_indexes]
 def replace_worst_element(previous, data):
    solution = previous.copy()
    worst_index = previous["distance"].astype(float).idxmin()
    random_candidate = data.loc[randint(low=0, high=len(data.index))]
    solution.loc[worst_index] = random_candidate
    return solution
 def get_random_solution(previous, data):
    solution = replace_worst_element(previous, data)
    while solution["distance"].sum() <= previous["distance"].sum():
        if solution.equals(previous):
            break
        solution = replace_worst_element(previous=solution, data=data)
    return solution
 def explore_neighbourhood(element, data, max_iterations=100000):
    neighbourhood = []
    neighbourhood.append(element)
    for _ in range(max_iterations):
        previous_solution = neighbourhood[-1]
        neighbour = get_random_solution(previous=previous_solution, data=data)
        if neighbour.equals(previous_solution):
            break
        neighbourhood.append(neighbour)
    return neighbour
 def local_search(m, data):
    first_solution = get_first_random_solution(m=m, data=data)
    best_solution = explore_neighbourhood(element=first_solution, data=data)
    return best_solution
 def execute_algorithm(choice, n, m, data):
    if choice == "greedy":
        return greedy_algorithm(n, m, data)
    elif choice == "local":
        return local_search(m, data)
    else:
        print("The valid algorithm choices are 'greedy' and 'local'")
        exit(1)
 def show_results(solutions, time_delta):
    distance_sum = solutions["distance"].sum()
    duplicates = solutions.duplicated().any()
    print(solutions)
    print("Total distance: " + str(distance_sum))
    if not duplicates:
        print("No duplicates found")
    print("Execution time: " + str(time_delta))
 def usage(argv):
    print(f"Usage: python {argv[0]} <file> <algorithm choice>")
    print("algorithm choices:")
    print("greedy: greedy algorithm")
    print("local: local search algorithm")
    exit(1)
 def main():
    if len(argv) != 3:
        usage(argv)
    n, m, data = parse_file(argv[1])
    start_time = time()
    solutions = execute_algorithm(choice=argv[2], n=n, m=m, data=data)
    end_time = time()
    show_results(solutions, time_delta=end_time - start_time)
 if __name__ == "__main__":
    main()