MSSQL 2000 to PostgreSQL 8 Migration Project

The following are notes on a migration from Microsoft SQL Server 2000 to PostgreSQL 8. The old database has 60 tables, some with a few million rows, and about 7 Gb of data in total. There were several issues to deal with:

  • The order of the columns in the tables of each schema aren't consistent. The older database has had columns added over time, while the new database schema has been created from the most current SQL scripts.
  • The data exported using bcp will need reformatting to be imported using psql.
  • Fields with NULL values will need special consideration. The bcp utility is not consistent in the use of 0x00 (the null character) to indicate a null value.
  • The MSSQL database has a mix of character encodings; The PostgreSQL database uses UTF8.
  • The import must be automated so it can be performed after the existing database is taken offline, and before the new one is put into production. This ensures that all user information is consistent.
  • The data will not be imported in a single transaction. Therefore, the order of the import must be correct for the foreign key relationships that exist.
  • The indexes and sequences of the new database must be set. (TODO)

The MSSQL database is hosted on a Windows server in a remote facility. I can open a VNC connection through an SSH tunnel, but the responsiveness is unacceptable. Instead, I use the Cygwin port of the OpenSSH daemon, running as a service, and providing a bash shell.

The first step in preparing to export the data is to get a list of table names. I used the isql utility to connect to the server:

isql -Q "select name from sysobjects where type = 'U'" \ -S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD \ | grep -Eo 'tbl_[a-z_]+' \ | sed -s 's/tbl_//g' \ | sort > tables

The result is file containing one table name (with the tbl_ prefix removed) per line. I copied the file as tables_to_export, and removed the tables that were not necessary to export. Incidentally, I don't know if the Microsoft documentation on these utilities will be accessible in the future. To be safe, I've created PDF archives of the pages on MSDN. The following bit of shell script loops over each line in the list of tables to be exported:

TEMPFILE=`mktemp` for T in `cat tables_to_export`; do bcp "tbl_${T}" format $TEMPFILE -f "${EXPORT_DIR}/${T}.fmt" -c -k \ -S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD bcp "tbl_${T}" out "${EXPORT_DIR}/${T}" -o "${EXPORT_DIR}/${T}.out" -c -k \ -r '_~R~_' -t '_~F~_' \ -S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD done

There are two runs of bcp for each table; the first documents the format of the table being exported, and the second performs the actual data export. The command line switches -c and -k specify character data, and to keep null values, rather than export with column defaults. I selected _~R~_ and _~F~_ for row and field terminators, rather than \n and \t, because the data itself contains new lines and tabs. After completing the export, I created a tarball and transferred it to an administration shell account on the same network as the new server. This little detail is important -- the new PostgreSQL server does not have file system access to the data files. See the documentation on the COPY command for more information.

The tarball is extracted to the local /tmp directory, rather than the NFS-mounted home directory for speed -- an important consideration when processing large files. The first task is to escape the embedded backslash, new line, and tab characters, convert the nulls, and then strip out the row and field terminators. A fairly straightforward Gawk program does the work:

BEGIN { RS="_~R~_" FS="_~F~_" OFS="\t" } { gsub(/\\/, "\\\\") gsub(/\000/, "\\N") gsub(/\r\n/, "\\n") gsub(/\n/, "\\n") gsub(/\t/, "\\t") $1 = $1 print $0 }

The data in some files required extra massaging, such as inserting missing null values or removing obsolete fields. For each data file that requires extra processing, I created a file of the same name containing additional gawk rules. The processing script checks for the extra rule files and applies them when found. The new files are created with .tab extensions, indicating that the file is ready for import.

This result is a set of data files that can be used by psql. However, the order of the columns must be dealt with. For that, I wrote a script to loop over all of the exported table names (bundled into the tarball), and compare the column order to the new database:

# Create a column list using bcp format files for T in `cat ${DBOLD_DIR}/tables_to_export`; do gawk 'BEGIN { FS="[ ]+" } /[ ]+/ { print $7 }' \ "${DBOLD_DATADIR}/${T}.fmt" > "${DBOLD_WORK}/${T}.columns" done # Create a list of all the tables in the new database psql -c '\dt *' | gawk 'BEGIN { FS=" " } /public/ { print $3 }' \ | sed -e 's/^ll_//' > $DBNEW_TABLES # Create column list for each table in the new database for T in `cat $DBNEW_TABLES`; do psql -c "\\d ll_${T}" \ | gawk 'BEGIN { FS=" " } /Column/ { next } /^[ ][a-z]/ { print $1 }' \ > "${DBNEW_WORK}/${T}.columns" done # Compare each exported table to the new table for T in `cat ${DBOLD_DIR}/tables_to_export` ; do echo "Table: $T" if [ ! -f ${DBNEW_DIR}/${T}.columns ]; then echo "Does not exist in new database." else # Quietly check for differences diff -w -i -q "${DBNEW_DIR}/${T}.columns" "${DBOLD_DIR}/${T}.columns" > /dev/null if [ $? != 0 ]; then # Show differences side-by-side diff -w -i -y "${DBNEW_DIR}/${T}.columns" "${DBOLD_DIR}/${T}.columns" else echo "No differences." fi fi echo done

I went to the effort to programmatically compare the old tables to the new tables for two reasons: 1) it's highly likely that I would have made errors doing it by visual inspection, and 2) the resulting output is used to create a column list used during the import of the data. These column list files are placed into a directory that will be checked by the import script. If a file is found matching the name of the table, the column list is used instead of a straight copy.

I don't yet have a tool for examining the SQL scripts, and determining the dependency order. Fortunately, this only needs to be done once (for this project). I created a file containing the table names in the order that they must be imported.

Finally, the data can be imported!

for T in `cat ${DB_IMPORT_INFO_DIR}/table_import_order`; do echo "Processing $T" # If the column order is special, use the column list if [ -f ${DB_IMPORT_INFO_DIR}/${T} ]; then Q=`head -n 1 ${DB_IMPORT_INFO_DIR}/${T}` psql -c "COPY ll_${T} (${Q}) FROM STDIN" < ${DB_DATA_DIR}/${T}.tab # Otherwise, perform a straight copy else psql -c "COPY ll_${T} FROM STDIN" < ${DB_DATA_DIR}/${T}.tab fi done

I should note that I set the PGCLIENTENCODING environment variable to LATIN1 in the import script. This makes the conversion to UTF8 explicit, and eliminates any byte sequence errors. Initially, I tried setting \encoding LATIN1 in the ~/.psqlrc file, as well as specifying -v encoding=LATIN1 on the psql command line -- neither of these methods cause the correct character encoding to be set during the import.

As indicated above, the SQL to set the indexes and sequences needs to be completed. Other than that, the project is nearly complete.