Example of SQL Server collation usage and problems

I was discussing with some developers this week why SQL Server collation mattered and what are typical things that can go wrong. This is a huge topic and is an often over looked feature of database and server design, especially when you want to build applications that run in multiple geographic locations and that should accept multiple language inputs. Being as I’ve worked extensively with the Finnish Swedish collation from my time working in Sweden, I have a reasonable level of understanding of some of the issues that can come up.

I wrote the following script to demonstrate an example of what can happen if you don’t get your collations correct. It’s probably the most common example I’ve seen over the years.

1. You have a SQL Server that is running one of the main Latin collations for English, be it UK, US, current windows collation or legacy SQL Server ones.

2. You have data from multiple languages stored in unicode columns such as nvarchar

3. You sort or search on the data and those people who have used data from non English languages that have special or extra characters in the alphabet, do not get the results that they expect.

This script only shows a fraction of the things that can actually go wrong, but it provides a simple demonstration of sorts and searches producing unexpected results. It creates its own database and tables on any test server you may choose to run it on (and then deletes them afterwards). It will work as a demo on any server as long as it’s not already running a Finnish Swedish collation.


use master
go

--i'm creating a non-default latin collation as per the legacy product I was looking at
--in this case however the results are the same as if you're running the more common default windows server collation 'Latin1_General_CI_AS'
create database collation_test collate SQL_Latin1_General_CP1_CI_AS
go

use collation_test
go

select SERVERPROPERTY('collation') as 'server collation'

select collation_name as database_collation from sys.databases where name = 'collation_test'

if exists
(
select name from sys.objects where name = 'collation_test' and type = 'u'
)

begin
 drop table collation_test
end

create table collation_test
(
test nvarchar(10)
)

set nocount on
go

insert collation_test (test) values ('aaaa')
insert collation_test (test) values ('ääää')
insert collation_test (test) values ('åååå')
insert collation_test (test) values ('öööö')
insert collation_test (test) values ('bbbb')
insert collation_test (test) values ('zzzz')

set nocount off
go

print 'select the results from the tables in differing collations'
print 'in Swedish alphabet the characters åäö follow after Z as characters 27 to 29'
print''
select test as "order in latin collation"
from collation_test
order by test

select test as "order in Finnish_Swedish_CI_AS collation"
from collation_test
order by test collate Finnish_Swedish_CI_AS

print 'do searches on the table'
print ''

select test as "search in latin collation"
from collation_test
where test > 'b'

select test as "search in Finnish_Swedish_CI_AS collation"
from collation_test
where test > 'b'
collate Finnish_Swedish_CI_AS

--clean up
use master
go
drop database collation_test
go

The bottom line is that if you want to do collations and mixed languages properly, you need to think about this stuff carefully at the design stage.