More SQLServer String Comparisons

There are a few methods of comparing strings in SQLServer. To start with, you can use normal comparison operators such as < and > and even =. I can not think of many useful reasons for doing this but let’s talk about what the results are of doing so. The exact results you will get are determined by the character set and collation you are using. There is also a dependency on the server you are using since some servers are case-sensitive. SQLServer however is not case-sensitive.

To be honest, the best thing to do is run the example provided and see how it behaves in whatever character set and collation you are using. I use the default collation SQL_Latin1_General_CP1_CI_AS and the results table I show reflects that.

CREATE TABLE ##table1 (
	  Id INT
	, text1 VARCHAR(50)
	, text2 VARCHAR(50)
);
INSERT INTO ##table1 (
	  Id
	, text1
	, text2
)
VALUES 
	  (1, 'This is a string to test against another string', 'This is a second string')
	, (2, 'a', 'b')
	, (3, 'b', 'a')
	, (4, 'abc', 'cba')
	, (5, 'abc', 'xyz')
	, (6, '', '123')
	, (7, '321', '')
	, (8, '', '')
	, (9, 'Twinkies', 'Twinkies')
	, (10, 'DingDongs', 'dingdongs')
	, (11, 'Smith', 'Smyth')
	, (12,'John', 'Patrick')
	, (13,'John', 'Johnathon')

SELECT
	text1 
	, CASE
		WHEN text1 = text2 THEN 'MATCH'
		WHEN text1 < text2 THEN 'LT'
		WHEN text1 > text2 THEN 'GT'
		ELSE NULL
		END
	, text2
	, SOUNDEX(text1) AS soundex1
	, SOUNDEX(text2) AS soundex2
	, DIFFERENCE(text1, text2) AS difference1and2
FROM ##table1

Text1	Equality	Text2	Soundex1	Soundex2	Difference
This is a string to test against another string	GT	This is a second string	T200	T200	4
a	LT	b	A000	B000	3
b	GT	a	B000	A000	3
abc	LT	cba	A120	C100	2
abc	LT	xyz	A120	X200	2
	LT	123	0000	0000	4
321	GT		0000	0000	4
	MATCH		0000	0000	4
Twinkies	MATCH	Twinkies	T522	T522	4
DingDongs	MATCH	dingdongs	D523	D523	4
Smith	LT	Smyth	S530	S530	4
John	LT	Patrick	J500	P362	0
John	LT	Johnathon	J500	J535	2

You will notice that I included columns for SOUNDEX and DIFFERENCE. Both of these are something I had never used before so I spent some time trying to understand them. Simple explanation on SOUNDEX is that it is an algorithm that a string is passed into and returns a character + 3 number respresentation of that string. The complicated answer is the algorithm itself.

Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.

Replace consonants with digits as follows (after the first letter):

b, f, p, v → 1

c, g, j, k, q, s, x, z → 2

d, t → 3

l → 4

m, n → 5

r → 6

If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter

Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

I will refer to a wiki article if you want the history and variants of SOUNDEX.

The second function, DIFFERENCE, is a count of how many of the SOUNDEX values match each other. It takes 2 strings and returns a value 0-4.

Where these methods come in handy is trying to provide similiar matches to a simple search string. For example, someone searches for the last name smith but you also want to return names that sound like smith. Such as smyth, smithey, or smythe. You will also get values such as shinoda, sneed, sandu, sandy, smith-mcneal, and sante-hunter.

Written on April 9, 2015