Skip to content

πŸ”’ Sudoku Solutions Enumerator (Sequential and Parallel)

License

Notifications You must be signed in to change notification settings

simonepri/sudoku-solver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

sudoku-solver

πŸ”’ Sudoku Solutions Enumerator (Sequential and Parallel)
Coded by Simone Primarosa and Q. Matteo Chen.

Introduction to Sudoku

Example Sudoku Board

Sudoku is a popular puzzle game usually played on a 9x9 board of numbers between 1 and 9.

The goal of the game is to fill the board. However, each row can only contain one of each of the numbers between 1 and 9. Similarly, each column and 3x3 box can only contain one of each of the numbers between 1 and 9. This makes for an engaging and challenging puzzle game.

A well-formed Sudoku puzzle is one that has a unique solution. A Sudoku puzzle, more in general, can have more than one solution and our goal is to enumerate them all, but this task is not always feasible. Indeed, if we were given an empty Sudoku table, we would have to enumerate 6670903752021072936960 solutions and it would take us thousands of years.

Definitions

In the following sections, we will use some symbols or words to refer to specific aspects of the Sudoku problem.

The table below summarizes the most important.

Term Description
S A perfect square indicating the number of columns, rows, and boxes of a Sudoku board.
N The total number of cells of a board given as N = S * S.
B The size of a single box of the matrix as B = √S
Board An instance of Sudoku represented with a S x S matrix.
Row A row of a board's matrix that can only contain one of each of the numbers in [1, S].
Column A column of a board's matrix that can only contain one of each of the numbers in [1, S].
Box A B x B sub-matrix of a board's matrix that can only contain one of each of the numbers in [1, S].
Cell A single entry of a board's matrix either empty or with a legal assignment.
EmptyΒ Cell A cell whose assignment has still to be found.
Cell'sΒ Candidates A list of values in [1, S] which can be legally placed in a particular cell.
SearchΒ Space The product of the candidates of all the empty cells.
Solution An assignment of values for all the empty cells of a board that satisfies the constraints.

Solving Algorithm

A typical algorithm to solve Sudoku boards is called backtracking. This algorithm is essentially a depth-first search in the tree of all possible guesses in the empty cells of the Sudoku board.

Sequential Backtracking

The sequential algorithm is implemented iteratively, and it simply explores the tree of all the legal candidates' assignment of each empty cells.

The pseudo-code that follows highlights its core parts.

def seq_sol_counter(board):
  stack = []

  if board.is_full(): return 1
  (row, col) = board.get_empty_cell()
  stack.push((row, col, EMPTY_CELL_VALUE))
  for val in board.get_candidates(row, col): stack.push((row, col, val))

  count = 0
  while len(stack) > 0:
    (row, col, val) = stack.pop()
    board.set_cell(row, col, val)
    if val == EMPTY_CELL_VALUE: continue

    if board.is_full(): count += 1; continue
    (row, col) = board.get_empty_cell()
    stack.push((row, col, 0))
    for val in board.get_candidates(row, col): stack.push((row, col, val))

  return count

The actual implementation can be found in src/main/java/sudoku/SequentialSolver.java.

It's important to notice that the strategy adopted by the get_empty_cell function to pick the empty cell can affect the search space and thus the time needed to enumerate all the solutions.

Another notable thing to consider is that the time complexity and space complexity of all the operations on the board inside the while loop (is_full, set_cell, get_empty_cell, get_candidates) can significantly impact the overall performance of the backtracking and thus has to be kept as efficient as possible.

More details about the computational complexity of the operations and the idea behind their implementation can be found in the "implementation details" section.

Parallel Backtracking

The parallel algorithm is implemented by parallelizing the recursive guesses of each empty cell using the fork/join model.

The pseudo-code that follows highlights its core parts.

def par_sol_counter(board, move):
  if move is not null:
    (row, col, val) = move
    board = board.clone()
    board.set_cell(row, col, val)

  if board.is_full(): return 1

  space = board.get_search_space_size()
  if space == 0: return 0
  if space <= SEQUENTIAL_CUTOFF:
    return seq_sol_counter(board)

  count = 0
  tasks = []
  (row, col) = board.get_empty_cell()
  for val in board.get_candidates(row, col):
    task = new recursive_task(par_sol_counter, board, (row, col, val))
    tasks.push(task)
  for i in range(1, len(tasks)):
    tasks[i].fork()
  count += tasks[0].compute()
  for i in range(1, len(tasks)):
    count += tasks[i].join()

  return count

The actual implementation can be found in src/main/java/sudoku/ParallelSolver.java.

The considerations that have been given about the sequential backtracking also hold for the parallel version. In addition to those, it's worth mentioning that two new methods (get_search_space_size and clone) and a new class (recursive_task) have been introduced.

More details about the computational complexity of the operations and the idea behind their implementation can be found in the "implementation details" section.

Implementation details

In this section, we discuss the purpose of the methods mentioned in the previous sections providing when appropriate some insights on how we made them efficient.

Check if the board is completed

The operation is_full consists in knowing whether the board contains at least an empty cell.

Clearly, a simple approach is to loop through the whole board and check whether or not there is an empty cell, but this would cost us O(N) each time. Instead of doing so, we keep the count the number of filled cells of the board (also called clues) and then compare it against the total number of cells of the board.

At the cost of a constant additional work inside the set_cell to keep the count updated allows us to lower the time complexity of the operation to O(1).

Check if a value is legal for a cell

The get_candidates operation has the job of returning the, possibly empty, list of valid values which can be legally placed in a particular empty cell.

To accomplish this, one could simply iterate on the row, column, and box of the cell given searching for unused values. Doing this would cost us O(3*S) + O(S) = O(S), and it's almost the best we can aim for this particular operation. "Almost", because we can remove the O(3*S) addend by keeping track of the used values on each row, column, and box of the board.

To do so, we keep a bit-set of S bits for each row, column, and box and each time a specific cell's value v is set we also set the bit at position v - 1 of the 3 bit-sets for the particular row, column, and box of the given cell. To check whether a value v is valid or not we just check if the bit at position v - 1 is not set in any of the 3 bit-sets for the particular row, column, and box of the given cell. Since each check is constant and we have S values to check, the overall time complexity is O(S).

This optimization costs us a constant additional work inside the set_cell method to keep the bit-sets updated and an additional per-instance memory usage of O(3*S) = O(S).

Count the number of candidates of a cell

The get_search_space_size computes the search space as defined in the "definitions" section.

Intuitively, we can do something like the get_candidates to count the number of candidates instead of creating a list, and this would cost us O(N) * O(S) = O(N * S) but there's a tricky and memory hungry approach allows us to reduce the cost of the operation to just O(N) * O(1) = O(N).

Let's say that for a particular cell we want to count the candidates, then the state of the 3 bit-sets would be the following one.

value 9 8 7 6 5 4 3 2 1
row 1 0 1 1 0 0 0 0 0
column 0 1 0 0 0 0 0 0 0
box 0 0 0 0 0 1 1 0 0

If we compute the bitwise or operation of the 3 bit-sets we obtain a new bit-set that has a 1 on every invalid candidate as showed below.

value 9 8 7 6 5 4 3 2 1
invalid 1 1 1 1 0 1 1 0 0

Thus, counting the number of candidates has been reduced to the problem of counting the zeros of a bit-set.

If we use fixed sized integers (e.g., 32-bit int) to represent our bit-sets, then we could use the integer given by the binary representation of the bit-set to accesses a pre-computed table that provides us with the answer of how many zeros or ones that particular number has in constant time. The pre-computed table has to be built only once and can be shared by all the boards instantiated and implies an additional memory usage of 2^O(S).

Find an empty cell

One possible strategy we can use for the get_empty_cell is simply to pick the first empty cell we find on the board.

We could do this in O(N) by just iterating on the board and returning the empty cell if present. Instead, we managed to do it in O(1) by keeping track of the following two information using O(S) additional per-instance memory:

  • The row index of the first row having an empty cell.
  • The column index of the first empty cell on of each row.

To obtain the info, on each set_cell operation we update the column index for the particular row in O(S) and then, if the row has no more empty cells, we update the row index accordingly.

Find the empty cell with the lowest number of candidates

As we mentioned in the sections above, the strategy used to pick the empty cell can impact significantly on the size of the search space. Indeed, the more is the number of legal candidates for a cell the lower is the probability that our guess for that cell will result in a sudoku solution.

Thus, is intuitively better to always try to guess values for cells that have the lowest number of candidates.

To do this without affecting the current complexity of the get_empty_cell, similarly as we did for the strategy above, we keep track of the following two information using O(S) additional per-instance memory:

  • The row index of the row that has the cell with the lowest number of candidates.
  • The column index of the cell with the lowest number of candidates of each row.

To obtain the info, on each set operation we do three things:

  • For the current row, we search the column with the lowest number of candidates in O(S) * O(1) = O(S).
  • For each row of the current box, we compare the number of candidates at the saved column index with the new number of candidates of all the columns of that box, and we update the saved column index for that row if needed in O(B) * O(B) * O(1) = O(S) * O(1) = O(S).
  • For each row, we compare the number of candidates at the saved column index with the new number of candidates of the current column, and we update the saved column index for that row and the saved row index if needed in O(S) * O(1) = O(S).

Optimized addition and multiplication with BigInteger

In Java, BigInteger objects are immutable and thus every time an operation is executed on them a new object is instantiated.

We implemented two modified versions of the BigInteger class, namely BigIntSum and BigIntProd, that are mutable BitInteger and allow us to do sums and products in constant amortized time.

In general, the libs we wrote doesn't guarantee constant amortized time operations, but they do in our particular use case in which we mostly sum and multiply together small numbers.

Parallelize branches using the fork/join framework

The Java's fork/join framework it's easy to reason with, and its work-stealing scheduler provides convenient theoretical guarantees.

A so-called RecursiveTask class (recursive_task in the pseudo-code above) can model fairly well a backtracking solver. Each backtracking choice can be tested concurrently forking on each possible candidate.

To minimize the overhead of task creation we employ two common strategies in the fork/join realm:

  • Reusing the same task rather than creating a new fork to evaluate the first choice.
  • Choosing a sequential cut-off in a way that the workload is distributed evenly among the tasks.

Parallelize board copy

Since the board object has to be modified to fill a cell, each parallel task has to have its local instance of the board.

Duplicating a board is an expensive operation, so instead of doing it eagerly in the callee task, we supply the RecursiveTask constructor with the change that we want to try, and we make the copy of the board in its compute method that will be executed in a separate thread.

In this way we offload an expensive computation in the forked task, decreasing the span.

Choose of the appropriate sequential cut-off

Due to the overhead involved with the creation of parallel tasks, it's faster to switch to the sequential algorithm when the size of the problem becomes small enough. We will call that threshold problem size sequential cut-off or just cut-off.

The way we can chose this value mostly depends on what we define as the size of the problem.

We experimented with the two following approaches:

  • The size of the problem is the number of empty cells to fill.
  • The size of the problem is the search space.

We measured limited differences when choosing one of these parameters, provided that we optimize the value of the cutoff accordingly, but we used the second definition.

The correlation graph in the "benchmark results" section readily explains this result, that is, the search space and empty cell have an almost perfect linear correlation. Thus one can be used to estimate the other.

Concretely, the optimal sequential cutoff can be found looking at processor utilization patterns: it should be full during most of the computation, and all the CPU should complete their task at the same time. The minimum sequential cutoff also has to consider the task creation overhead.

Performance comparison

We have built a benchmarking tool with which we have evaluated different KPI of the two implementation we implemented.

The data collected during the benchmarks can be found in data/benchmark. Note that all the times are in ΞΌs.

Benchmarking environment

To benchmark our implementation we used a handful of test cases differing mainly for the number of legal solutions.

The table that follows summarizes the main characteristics of the tests.

test name total cells empty cells filling factor solutions search space
1a 81 53 34.57% 1 10^25
1b 81 59 27.16% 4,715 10^36
1c 81 61 24.69% 132,271 10^40
1d 81 62 23.46% Β 587,264 10^41
1e 81 63 22.22% 3,151,964 10^43
1f 81 64 20.99% 16,269,895 10^45
2a 81 58 28.40% 1 10^31
2b 81 60 25.93% 276 10^35
2b 81 62 23.46% 32,128 10^39
2d 81 64 20.99% 1,014,785 10^43
2e 81 65 19.75% 738,836 10^45
2f 81 66 18.52% 48,794,239 10^47

The test files can be found in src/benchmark/boards.

To get a scalable and homogeneous environment, we leveraged the Google Cloud Infrastructure.

The table that follows shows the specs of the machine we used.

os core cpu
Linux 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64 Intel(R) Xeon(R) CPU @ 2.30GHz

Benchmark results

The benchmark reveled a speed-up that grows almost linearly in relation to the number of cores.

The table that follows shows the speed-up values grouped by test case and core count.

Speedup for test series 1 Speedup for test series 2

Unfortunately due to the overheads introduced in the parallel algorithm, the speed-up obtained is sometimes slightly smaller than 1, but this happens only on smaller test cases or with a very low number of cores.

To understand the root cause of the values obtained, we computed a correlation matrix over all the data we gathered.

While the sequential time depends linearly on the number of solution and it's uncorrelated with the number of cores, the factors that make up the parallel time and speed-up are more varied.

The correlation matrix is muddled by the core count, thus we plotted several correlation matrices grouping the data by core count. On the right there is an example of the correlation matrix for the test data obtained on the 64 core machine.

Correlation matrix for all data Correlation matrix for 64 cores

We can clearly see that the speed-up is correlated greatly with the search space and the empty cells, rather than only with the number of solutions. This is expected because search space and empty cells are almost linearly dependent and search space determines how much branching we can have on our parallel solver. The same trend can also be found looking at the speed-up plot.

The correlation matrices also show us that even though both the parallel and sequential times are linearly dependent on the number of solutions, the speed-up is not very strongly correlated with it as well.

We can conclude that the speed-up comes from being able to explore the whole search space faster and that it could be improved a little by tweaking the fork/join parameters.

Usage

Sudoku solution enumerator CLI

The project is provided with a CLI that allows you to run the solver on your machine with ease.

If you want to run it locally, you need to run the following commands.

git clone https://github.com/simonepri/sudoku-solver.git
cd sudoku-solver

./sudoku

NB: This will also trigger the build process so be sure to have the Java JDK installed on your machine prior to launch it.

Benchmarking suite

Sudoku solution enumerator benchmarking CLI

The project is provided with a CLI that allows you to reproduce the tests results on your machine.

If you want to run it locally, you need to run the following commands.

git clone https://github.com/simonepri/sudoku-solver.git
cd sudoku-solver

./bench

NB: This will also trigger the build process so be sure to have the Java JDK installed on your machine prior to launch it.

TIP: You can stop a test by hitting CTRL+C or Command+C.

Development

Clone the repository to your local machine then cd into the directory created by the cloning operation.

git clone https://github.com/simonepri/sudoku-solver.git
cd sudoku-solver

The source code for the sudoku solver can be found in src/main/java/sudoku, while the source code for the unit tests and the benchmarking suite can be found in src/test/java/sudoku and src/benchmark respectively.

Build the project, run the unit tests and run the CLI.

# On Linux and Darwin
./gradlew build
./gradlew test
./gradlew run

# On Windows
./gradlew.bat build
./gradlew.bat test
./gradlew.bat run

NB: You will need the Java JDK installed on your machine to build the project.

Authors

License

This project is licensed under the MIT License - see the license file for details.