Skip to content

file/file-tests

Repository files navigation

==============================================================================
================== File-tests - File regressions detection ===================
==============================================================================

Author: Jan Kaluza <jkaluza@redhat.com>



1) WHY IS IT USEFUL?
====================

It's not possible to check the regressions between File releases manually and
unfortunatelly File can't work 100% and even a small change in one magic pattern
can cause a big regression.

This project aims to make comparing outputs of different File versions easier
and provides database of files in different formats.



2) HOW TO USE IT?
=================

This howto presumes that CWD is the root of the git repository (the same
directory as this README file is in).

2.1: Prepare the Magdir:
------------------------

You have to provide the Magdir directory with magic patterns from the File
sources of the exactly the same File version as you want to test. This is
needed to identify particular magic pattern responsible for the particular
file detection.

You can just copy Magdir into the CWD:

$ cp -pr /path/to/file/magic/Magdir ./

2.2: Update the database with the current/old File version:
-----------------------------------------------------------

Note that this whole procedure needs 3.5GB free space, because
patterns are splitted and compiled by file individually. See
HOW DOES IT WORK section for more details.

$ python update-db.py <unique-name> [path_to_magdir]

So in our case for example:

$ python update-db.py file-5.07 Magdir

2.3: Install new File version you want to test regressions against:
-------------------------------------------------------------------

$ sudo rpm -U my-file.prm my-file-libs.rpm ...

2.4: Run fast-regression-test:
------------------------------

Fast regression test tests the output of the currently installed File and
compares it to the one stored in db/*/*.json files. If there's a significant
difference (see HOW DOES IT WORK) between the output, it's showed as FAIL.

$ python fast-regression-test.py

If you run this test with "exact" argument, particular test will FAIL even
if there's small change in the File output.

2.5: Identify patterns causing the regressions:
-----------------------------------------------

With compare-db.py script, you can get the patterns causing the regression.
At first you have to prepare the Magdir from the current File version as
mentioned in step "2.1: Prepare the Magdir".

Note that this whole procedure needs 3.5GB free space, because
patterns are splitted and compiled by file individually. See
HOW DOES IT WORK section for more details.

$ python compare-db.py <unique-name> [path_to_magdir]

So in our case for example:

$ python compare-db.py file-5.10 Magdir

If you run this test with "exact" argument, particular test will FAIL even
if there's small change in the File output:

$ python compare-db.py file-5.10 Magdir exact
or
$ python compare-db.py file-5.10 exact



3) HOW DOES IT WORK?
====================

There is database of files in "db" directory.

3.1: update.db script:
----------------------

At first, this script splits all the patterns from Magdir directory into
separated files, so there is one magic pattern per file. Those files are
stored into ./.mgc_temp/<file_name_hash>/output.

After that, patterns are compiled using this algorithm:

N = 0
pattern_to_compile = ""
for every pattern:
	pattern_to_compile += pattern
	compile pattern_to_compile
	save it as ./.mgc_temp/<file_name_hash>/.find-magic.tmp.N
	N++

Then all files in the db directory are detected. During the detection,
particular pattern responsible for detection is found using the bisection
method.

All metadata are stored into ./db/*/*.json file.

3.2: fast-regression-test.py script:
------------------------------------

This script simply gets metadata stored in the ./db/*/*.json files and
compares them with the output of the current File version.

In non-exact mode, regressions are detected using the Ratcliff/Obershelp
pattern recognition implemented by difflib.SequenceMatcher Python module.
The regression is detected if the ratio returned by the SequenceMatcher is
lower than 0.7.

In exact mode, the old and new output have to match exactly.

3.3: compare-db.py script:
--------------------------

This script does the same work as update.db script, but instead of storing
metadata into ./db directory, it compares them with the metadata already
stored in ./db directory by previously run update-db.py script.

Comparison is done the same way as in fast-regression-test.py script.