multi-leven

_{Nicholas Kyriakides (@nicholaswmin)

Euthor LTD

Nicosia, Cyprus

2019.}

multi-leven

The Levenshtein-distance on multiple processes, in NodeJS

A parallelisation of leven.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

This is a working example of a special case of parallelization, called Data Parallelism. In single-threaded runtimes (such as NodeJS), short of messing with native C++ NodeJS modules, it is in effect the only viable way to perform parallelization of typical fuzzy search algorithms.

This experiment was created to speed up the search process of a proprietary AML (Anti-Money Laundering) system that fuzzily links customers of financial organisations to entries in OFAC's Politically-Exposed Persons, Specially Designated Nationals And Blocked Persons/Entities.

Warning: If you're a designer/architect/engineer of an AML/KYC system, please ensure you read the Warning section below.

Caution

If you're a designer/architect/engineer of an AML/KYC system, please ensure you read the Warning section below.

Installation

TBD

Usage

const MultiLeven = require('multi-leven')

// Ideally, `workersNum` reflects the number of physical cores on your
// system.
const searcher = new MultiLeven(['John', 'Mary', 'Foo', 'Bar', 'Baz'], {
  workersNum: 4
})

searcher.init()
  .then(() => searcher.search('Jonn'))
  .then(results => {
    console.log(results) // logs `['John']`
  })

Note: Ideally, you will invoke init() only once, on startup. From then on you can call .search(arg) as many times as you want.

You should use this tactic because initialisation of very large input arrays is slow due to IPC serialisation overhead.

How it works

The passed input array is split in chunks, equal to the numbers of requested workers. Each chunk is loaded in it's own worker.

When you .search(arg), it simply fires simultaneous requests to search for the passed name in each worker.

The workers are not terminated when the search completes. They stay up with their own chunk of the input array waiting for the next search.

Tests

$ npm test

Performance testing

These are some preliminary performance results. More workers predict better performance.

Creating 4000000 input items,
equalling 198 MB's of test data.


  Process on (1) worker
    #init
      ✓ initializes (4488ms)
    #search
      ✓ finds a result (14411ms)

  Process on (4) workers
    #init
      ✓ initializes (3938ms)
    #search
      ✓ finds a result (3940ms)


  4 passing (27s)

Tests were run on an iMac 2013, 3.2GHz i5, 8GB RAM running MacOS 10.12.6

Authors

Nicholas Kyriakides, @nicholaswmin

Todos

Add teardown methods
Include difference in each returned match (or pass difference tolerance to filter out irrelevant matches)

Warning for use in AML systems

The following warning pertains only for designers/architects/engineers of AML (Anti-Money Laundering) and KYC (Know Your Customer) systems.

The Levenshtein-distance, in particular, is very widely used in AML systems. This code is experimental and not fit for use as-is in such systems. Even if it was, a robust AML/KYC system should NOT rely on one type of matching algorithm to cross-match persons/entities to blacklisted persons/entities.

This code, as-is, is NOT appropriate for use of fuzzy-matching in such systems. First and foremost, it's method of operation is publicly available. Secondly, it's an experiment, it is not designed to adhere to formal matching performance requirements nor does it have a complete and automated test suite to validate its results.

Because of the very sensitive nature of the goals of such systems, you should use a multi-attribute matching mechanism, including the use of extra approximate-string matching algorithms (Soundex, Jaro–Winkler distance, etc).

There are multiple known instances of bad actors tricking such systems simply by knowing how the screening/matching mechanism of a financial organisation works, in some cases even the financial organisations themselves trick their own system intentionally to allow blacklisted persons and entities; access to the financial system without legal repercussions.

The internal workings of matching mechanisms of AML/KYC systems should be kept confidential and on a need-to-know basis.

For more information on this topic, the FATF has published a paper that lists and describes various methods of designing robust AML/KYC systems. You can find the paper here: Opportunities and Challenges of New Technologies for AML/CFT

License

MIT License

Massachusets Institute of Technology, SPDX: MIT

Copyright 2019 Nikolas Kyriakides (@nicholaswmin)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package.json		package.json
worker.js		worker.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

index.js

index.js

package.json

package.json

worker.js

worker.js

Repository files navigation

multi-leven

Installation

Usage

How it works

Tests

Performance testing

Authors

Todos

Warning for use in AML systems

License

MIT License

About

Releases

Packages

Languages

License

nicholaswmin/multi-leven

Folders and files

Latest commit

History

Repository files navigation

multi-leven

Installation

Usage

How it works

Tests

Performance testing

Authors

Todos

Warning for use in AML systems

License

MIT License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages