String comparison is important for topics such as natural language processing and record linkage. This post gives a few examples of string comparison techniques that you may wish to consider.
String Comparison Techniques
Each of these string comparison techniques makes different assumptions or simplifications. You may wish to try several techniques or use a hybrid approach which combines several metrics together to make an overall comparison.
Edit Distance – Levenshtein Distance
Levenshtein distance is an example of an ‘edit distance‘ metric. It shows you how many changes (deletions, insertions, or substitutions) you would need to make to string ‘A’ to get it to equal string ‘B’.
Try out this interactive example:
Similarity Coefficient – Jaccard Index
The Jaccard Index is also known as the ‘intersection over union’ measure and is applicable to any set, not just string comparison. For string comparison the words or individual characters in a string are represented as a ‘set’ before the index is computed.
The intersection is the number of words or characters that are the same in both strings (sets). The union is the combination of all words or characters in both strings (sets).
Try out this interactive example:
Phonetic Algorithm – Soundex
Soundex is an example of a ‘phonetic algorithm‘. It works be reducing words down to a simplified approximation of the essential sounds that make them up. Soundex and other phonetic algorithms are frequently used for comparing names where there may be different spellings for the same or similar names.
For example ‘Steven’ and ‘Stephen’ both become ‘S315’, so would be regarded as having an identical sound.
See the code for an implementation of the soundex algorithm