String similarity scoring

Ratio algorithm

The ratio scoring is well adapted to compare to strings that are expected to be nearly equal (up to some character permutations).

It does not yield useful results to compare strings of greatly different lengths even one is included in the other, or to compare the word contents of longer strings composed of words.

Left string	Right string	Score
tea time	teatime	93
the dog was walking on the sidewalk	the dog was walking on the side walk	98
the dog was walking on the sidewalk	the d og as walkin' on the side alk	91
Mr John Apfelstrudel	John Apfelstrudel	92
Mr Mrs John Jane OR Doe Smith	John Doe	43
tree apple peach pear tomato salad	apple peach	49
banana	ananas	83
banana	strawberry	12

Use it if you have a high confidence that the two strings you want to match are similar in length and content, and just want to control for small input errors or such.

Even then, be careful that strings "accidentally" similar may have a high similarity score even though they have obviously different meanings ("banana" and "ananas").

Token set ratio

The token set ratio scoring works to match strings that are composed of the same words, allowing for minor differences between the two sides. In a banking context, it typically works best to match names or transactions labels, where the simple "ratio" usually fails.

However, it will easily return high scores if all the words on one side are included in the other, which can happen easily for small words, so be careful what you match against: with token set ratio, the risk is to overfit rather than underfit.

The one case where token set ratio will return a lower score than a simple ratio is for two strings that look broadly similarity from "high level" point of view, with lots of small errors on the words (see row 3 in the table below).

Note that the two will produce the same score when matching single-word strings.

Left string	Right string	Score
tea time	teatime	93
the dog was walking on the sidewalk	the dog was walking on the side walk	99
the dog was walking on the sidewalk	the d og as walkin' on the side alk	72
Mr John Apfelstrudel	John Apfelstrudel	100
Mr Mrs John Jane OR Doe Smith	John Doe	100
tree apple peach pear tomato salad	apple peach	100
tree apple peach pear tomato the salad	the	100

So which one should I use ?

Ultimately, it depends on your use case and we cannot decide for you. But you can follow the questions below to help you decide.

Do you want to have a very high level of confidence that only strings that are almost identical (up to a few insertions/deletions) return a high score, at the cost of rejecting medium similarity matches?
- Consider using the simple ratio score
Does one of the strings you want to match come from an unsanitized human input (e.g. a transaction label or an email object)?
- Yes: you should probably opt for token set ratio, unless you believe you know almost exactly the said human input (unlikely in most cases)
Do you want to match series of words without respect for their order?
- Definitely use token set ratio

In short, token set ratio is likely to be more versatile, you should just be careful with accidental matches, especially if one side of the comparison is really short.

What about the matching levels in Marble

Once you have chosen an algorithm, you still need to define what is a reasonable matching level.

In the Marble rule builder, we propose two predefined levels: HIGH which corresponds to a score of 85% or higher, and MEDIUM which corresponds to a score of 70% or higher.

We believe that matching scores below 70% are often hard to interpret as a match, because a score of 50-60% can easily occur "accidentally".

Other possibilities

There exist plenty of great articles on the different string similarity algorithms (in particular those derived from the python "rapidfuzz" library, as used in Marble). Here are some links link 1, link 2.

You will see that other similarity scores exist, which we have chosen not to include in the Marble rule builder for now, for the sake of simplicity. If you feel like you need to use one of those variants, feel free to reach out to us.