Search & scoring algorithm
The search and scoring algorithm works in two steps:
- Candidate generation
- Candidate scoring
Both steps matter when interpreting the output of a search.
1. Candidate generation
When we receive a search query — typically a name, optionally with other fields like date of birth, registration number, or country — we first run an initial query against a search engine (Elasticsearch, to be precise). This step searches primarily by name, in several ways:
- Fuzzy full name matching, normalized so that characters are simplified (e.g.
é→e, case folding, etc.). These come up at the top of the results list. - Token-based matching: the input name is cut into pieces, and search is done using both the unmodified (normalized) name parts and a phonetic transformation of those parts.
If secondary search attributes are passed (e.g. birth date or country), they increase the quality of the match: all else being equal, candidates with a matching birth date rank higher.
This step typically generates a large number of candidates — usually in the hundreds or thousands.
At this stage, the score that the threshold is later applied to has not yet been computed. An internal score is used, but only for ordering — its absolute value is not meaningful.
2. Candidate scoring
All candidates from step 1 are re-scored by a more precise algorithm. This score is normalized — or more precisely, it depends only on the candidate's attributes and the search query — which makes it meaningful in absolute terms. As a result, it is suitable for display to the user and for applying a threshold.
At a high level, the default algorithm works as follows:
- The main comparison is on the names. An exact literal match yields 100% similarity. Otherwise, the name is split into components, and we apply a combination of string edit distance per component and phonetic match per component. The score is the larger of the two, with a multiplier (by default 80% to 90%).
- Secondary search attributes (birth date, country, ...) can only decrease the score in this step. For example:
- Query
name: john doeagainst entity{name: John Doe, birthDate: 1990-01-01}→ matches at 100%. - Query
name: john doe; birthDate: 1975-02-01against the same entity → matches at 85%.
- Query
The final score is compared against the threshold; only candidates above the threshold are returned.
Remarks on the two-step process
- The most important thing to realize is that a search that is not precise enough can produce false negatives, not only excessive false positives. Concretely, this means that PEP and Adverse Media searches in particular are expected to work much better with secondary search attributes such as birth date or country — and not only because there will be fewer false positives.
- Sanctions, having smaller datasets, is the only area where we expect name-only matching to be very reliable.
- The number of candidates produced in step 1 currently depends on the "result limit", which is awkward; we intend to change this and use an organization-level default setting instead. If a manual search does not return some results that were expected, raising the "result limit" can be a good workaround.
Recommendations on matching percentage
Below are example scores you may expect, scenario by scenario:
| Scenario | Expected score |
|---|---|
| Name-only search, exact match | 100% |
| Name exact match; country or birth-date mismatch | 80% |
| Company search, exact name match; registration-number mismatch | 80% |
| Name match in phonetic only (raw string mismatch); birth-date match or no birth date passed | 90% |
| Partial name match in string components or phonetic; birth-date match or no birth date passed | 60–80% |
| Partial name match in string components or phonetic; birth-date or country mismatch | 40–60% |
| Vessel (ship) search; IMO or MMSI exact match | 95% |
| Name-only search; phonetic or string similarity match on half the name (e.g. first name only) | 40–50% |
In most scenarios, scores below 50% are where matches become almost irrelevant. The fine tuning mostly happens in the 50–80% range, depending on risk sensitivity — but only for limiting false positives. False negatives are handled by more precise filters, not by adjusting the threshold.
Technical appendix
The components of the algorithm currently used by default with Marble are documented here: https://www.opensanctions.org/matcher/#logic-v1.
Updated about 17 hours ago