Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cython implementation of minimum_hash is 3x faster #98

Open
alexjc opened this issue Sep 25, 2023 · 1 comment
Open

Cython implementation of minimum_hash is 3x faster #98

alexjc opened this issue Sep 25, 2023 · 1 comment

Comments

@alexjc
Copy link

alexjc commented Sep 25, 2023

I know this is a reference implementation, but it's the one that uses the iscc package name on PyPI and as such will be the most used. Thus, here's a big performance suggestion.

The bottleneck in hashing long texts is minimum_hash. This is a much faster version. Compile with

CFLAGS="-O3" cythonize -a -i utils.pyx

This is the code:

# cython: language_level=3

from cython.view cimport array

MINHASH_PERMUTATIONS = [
    # Copy values from iscc/const.py
]

cpdef unsigned int[:] cython_minimum_hash(unsigned int[:] features, int n=64):
    cdef unsigned long max_int64 = (1 << 64) - 1
    cdef unsigned long mersenne_prime = (1 << 61) - 1
    cdef unsigned long max_hash = (1 << 32) - 1
    cdef unsigned long a, b, min_val
    cdef unsigned int[:] result = array(shape=(n,), itemsize=sizeof(unsigned int), format="I")

    cdef int i, j
    for i in range(n):
        a, b = MINHASH_PERMUTATIONS[i]        
        min_val = max_int64
        for j in range(features.shape[0]):
            v = ((a * features[j] + b) & max_int64) % mersenne_prime
            min_val = min(min_val, v & max_hash)
        result[i] = <unsigned int>min_val
    return result

You can also use setup.py to build the module:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "utils",
        ["utils.pyx"],
        extra_compile_args=['-O3'],
        language='c',
    )
]

setup(
    ext_modules = cythonize(
        ext_modules,
        compiler_directives={
            'language_level' : '3',
            'linetrace': False,
            'binding': False
        }
    )
)

Until it's integrated, I monkey patch it like so:

from utils import cython_minimum_hash

def fast_minimum_hash(generator, n=64):
    array = np.array(list(generator), dtype=np.uint32)
    result_fast = cython_minimum_hash(array, n)
    return result_fast

iscc.iscc.minimum_hash = fast_minimum_hash

The last function is a good place to insert comparison of the results with the slow version. I hashed hundreds of long documents to check they are identical.

@titusz
Copy link
Member

titusz commented Oct 5, 2023

The iscc package on PyPI is out of date. It will eventually be updated when the final ISO standard is published.

The current reference implementation can be found at https://github.com/iscc/iscc-core.

We have added some basic optional Cython support there. Pull requests for performance improvement are very welcome. An up-to-date higher level lib with content extraction support can be found here: https://github.com/iscc/iscc-sdk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants