Skip to content

SaswatPadhi/ExcelSynth

Repository files navigation

ExcelSynth

 

An enumerative synthesizer for recovering Excel formulas from CSVs.

Input: CSV File (samples/example.csv) Output: Formula Mask
                
Col 1  ,  Col 2  ,  Col 3  ,  Col 4  ,  Col 5
Row 2  ,  1.     ,  10.    ,  9.5    ,  24.
Row 3  ,  23.    ,  12.    ,  0.5    ,  35.
Row 4  ,  22.    ,  2.     ,  -9.    ,  24.
Row 5  ,  -1.    ,  6.     ,  6.5    ,  5.
Row 6  ,  59.    ,  0.     ,  -29.5  ,  41.
Row 7  ,  11.    ,  -2.    ,  -7.5   ,  9.
Row 8  ,  115.   ,  14.    ,  -43.5  ,  23.
                
            
                
 ,             ,                       ,                   ,
 ,             ,                       , =(C2-(B2/(1.+1.)) ,
 ,             ,                       , =(C3-(B3/(1.+1.)) ,
 ,             ,                       , =(C4-(B4/(1.+1.)) ,
 ,             ,                       , =(C5-(B5/(1.+1.)) ,
 ,             ,                       , =(C6-(B6/(1.+1.)) ,
 ,             ,                       , =(C7-(B7/(1.+1.)) ,
 , =SUM(B2:B7) , =(SUM(C2:C7)/(1.+1.)) , =(C8-(B8/(1.+1.)) , =AVERAGE(E2:E7)
                
            

Installation

  1. Get docker for your OS.
  2. Pull the docker image#: docker pull padhi/excelsynth.
  3. Run a container over the image: docker run -it padhi/excelsynth.
    This would give you a bash shell within ExcelSynth directory.
  4. To run ExcelSynth on samples/unit_test.csv, execute: dune exec bin/App.exe -- samples/unit_test.csv
  5. To run the unit tests, execute: dune runtest

# Alternatively, you could also build the Docker image locally:

docker build -t padhi/excelsynth github.com/SaswatPadhi/ExcelSynth

Usage

Formula Synthesis from CSV

$ dune exec bin/App.exe -- -h
Synthesize Excel formulas for a CSV file.

  App.exe [flag] ... FILENAME

=== flags ===

  [-check-last-col-aggregations BOOLEAN]     synthesize aggregation formulas for
                                             cells in the last column
  [-check-last-row-aggregations BOOLEAN]     synthesize aggregation formulas for
                                             cells in the last row
  [-check-pointwise-col-operations BOOLEAN]  synthesize pointwise
                                             transformations for columns
  [-check-pointwise-row-operations BOOLEAN]  synthesize pointwise
                                             transformations for rows
  [-constant FLOAT] ...                      additional Boolean/numeric/string
                                             constants
  [-disable-constant-solutions BOOLEAN]      disable constant formulas (e.g.
                                             =0.0) for cells
  [-enable-2d-aggregation BOOLEAN]           use 2D ranges in aggregation
                                             operations
  [-enable-booleans BOOLEAN]                 enable Boolean and conditional
                                             expressions
  [-log-path FILENAME]                       enable logging and output to the
                                             specified path
  [-mask-path FILENAME]                      a known formula mask for the CSV
                                             file
  [-max-expr-size INTEGER]                   maximum cost (AST size) of
                                             expressions to explore
  [-max-threads INTEGER]                     maximum number of threads to create
  [-range STRING]                            a range (in RC:R'C' format) that
                                             bounds the synthesis space
  [-relative-error FLOAT]                    the fractional relative error
                                             allowed in float comparisons
  [-restrict-to-top-left-data BOOLEAN]       only use data to the top left of a
                                             cell in formulas
  [-type-error-threshold FLOAT]              maximum fraction of cells that may
                                             be ignored due to type errors
  [-value-error-threshold FLOAT]             maximum fraction of cells that may
                                             be ignored due to value errors

Bulk Processing (scripts/evaluate.sh)

The following input directory structure is required:

<data>
 |
 +-- table_ranges.csv              <--- Contains table ranges for CSV files
 |
 +-- evaluated_csvs                <--- Contains fully evaluated CSV files
 |    |
 |    +-- <a>.csv
 |    |
 |    `-- <b>.csv
 |
 +-- formula_csvs                  <--- Contains CSV files with formulas
      |
      +-- <a>.csv
      |
      `-- <b>.csv

The following output data is generated within this directory:

<data>
 |
 : · · ·
 |
 : · · ·
 |
 +-- extracted_masks               <--- Contents generated by scripts/extract_mask.py
 |    |
 |    +-- <a>.csv                  <--- Ground truth mask from `../formula_csvs/<a>.csv`
 |    |
 |    `-- <b>.csv
 |
 +-- recovered_masks               <--- Contents generated by scripts/recover_mask.py
 |    |
 |    +-- Baseline                 <--- Unrestricted synthesis (over whole sheet)
 |    |    |
 |    |    +-- <a>.csv             <--- Synthesized mask from `../evaluated_csvs/<a>.csv`
 |    |    |
 |    |    `-- <b>.csv
 |    |
 :    : · · ·
 |    |
 |    `-- <table_detector_n>       <--- Synthesis restricted to tables from <table_detector_n>
 |         |
 |         +-- <a>.csv             <--- Synthesized mask from `../evaluated_csvs/<a>.csv`
 |         |
 |         `-- <b>.csv
 |
 `-- comparison_masks              <--- Contents generated by scripts/compare_masks.py
      |
      +-- full                     <--- All cells within a recovered mask are checked
      |    |
      |    +-- Baseline            <--- Evaluation of unrestricted-synthesis masks
      |    |    |
      |    |    +-- <a>.csv        <--- Evaluation of `../recovered_masks/Baseline/<a>.csv`
      |    |    |
      |    |    `-- <b>.csv
      |    |
      :    : · · ·
      |    |
      |    `-- <table_detector_n>  <--- Evaluation of restricted-synthesis masks
      |         |
      |         +-- <a>.csv
      |         |
      |         `-- <b>.csv
      |
      `-- in-table                 <--- Only in-table cells are checked
           |
           +-- Baseline
           |    |
           |    +-- <a>.csv
           |    |
           |    `-- <b>.csv
           |
           : · · ·
           |
           `-- <table_detector_n>
                |
                +-- <a>.csv
                |
                `-- <b>.csv

Extract A Formula Masks from CSVs

$ python3 scripts/extract_mask.py -h
usage: extract_mask.py [-h] -i INPUT_DIR -o OUTPUT_DIR

optional arguments:
  -h, --help            show this help message and exit

  -i INPUT_DIR, --input-dir INPUT_DIR
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR

$ python3 scripts/extract_mask.py -i data/formula_csvs -o data/extracted_masks

Recover Formula Masks from CSVs

$ python3 scripts/recover_mask.py -h
usage: recover_mask.py [-h] -e EVAL_CSV_DIR -o OUTPUT_DIR -c TABLE_RANGE_COLUMN
                       tables_data_csv

positional arguments:
  tables_data_csv

optional arguments:
  -h, --help            show this help message and exit

  -e EVAL_CSV_DIR, --eval-csv-dir EVAL_CSV_DIR
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
  -c TABLE_RANGE_COLUMN, --table-range-column TABLE_RANGE_COLUMN

$ python3 scripts/recover_mask.py -e data/evaluated_csvs -o data/recovered_masks \
                                  -c 1 data/table_ranges.csv

tables_data_csv has extracted table ranges in TABLE_RANGE_COLUMN (0-indexed).

Compare Formula Masks

$ python3 scripts/compare_masks.py -h
usage: compare_masks.py [-h] -g GROUND_TRUTH_DIR -p PREDICTION_DIR -o OUTPUT_DIR

optional arguments:
  -h, --help            show this help message and exit

  -g GROUND_TRUTH_DIR, --ground-truth-dir GROUND_TRUTH_DIR
  -p PREDICTION_DIR, --prediction-dir PREDICTION_DIR
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR

$ python3 scripts/compare_mask.py -g data/extracted_masks \
                                  -p data/recovered_masks/table_detector_1 \
                                  -o data/comparison_masks/table_detector_1