Viewing Category: PostgreSQL  [clear category selection]

Databases setup for ContentBox using Amazon RDS PostgreSQL

I've been working most of the weekend on getting ContentBox working with a PostgreSQL database. I'm using Amazon RDS on this project, which is new to me. I'm using SQLPro Studio over an Amazon VPN connection (IPsec between pfSesnse and Amazon VPC -- more on that in another post). While trying to get Hibernate ORM to create the tables properly, I've wanted to drop the database and recreate it. Unfortunately, when trying to do that, I got this error message:

ERROR: database "cbblog" is being accessed by other users
DETAIL: There are 3 other sessions using the database.

Sup, bro?! It's just me, and I shut Tomcat down, so who's "using the database"? Using this statement, while connected using the simpiler PostgreSQL client psql, I can list them out:

SELECT pid, usename AS username, application_name AS appname, datname AS dbname, client_addr
FROM pg_stat_activity
WHERE NOT pid = pg_backend_pid();

That's helpful. It was me running SQLPro Studio; it makes a bunch of connections when rendering the object tree. If I just close the app, the connections go away. If it wasn't something local, I could aggressively kick them off like this:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE usename = 'cbblog';

With the blocking user connections disappeared, I can prep a clean database for ContentBox to use:

-- Make sure it's really gone
SELECT datname AS dbname
FROM pg_database
WHERE datistemplate = false;
-- Get rid of the role for a fresh start
DROP ROLE cbblog;
--- Setup the new user and its database
GRANT cbblog TO postgres;

No, that's not the actual password. Hopefully I can get ContentBox to play nice with PostgreSQL so I can migrate my crusty old blog. It'll be nice to get off Open BlueDragon and Tomcat 6. ;)

In retrospect, this is a silly post. Of course my SQL client is going to cause the database it's using to be in use. But maybe the diagnostic tricks will be helpful to the next sucker. I immediately assumed this was an Amazon RDS problem. Turns out, it was a me problem.

Resetting PostgreSQL Sequences After Import

In the last blog post about migrating from MSSQL to PostgreSQL, I left one detail for later -- resetting all the sequences after importing data. I created two PL/pgSQL functions to do the work of selecting the maximum value in the primary key and updating the sequence.

CREATE OR REPLACE FUNCTION get_pk_max(table_name VARCHAR(100), pk_name VARCHAR(100)) RETURNS INTEGER AS $$ DECLARE i INTEGER; DECLARE q TEXT; BEGIN q := 'SELECT max(' || pk_name || ') FROM ' || table_name; EXECUTE(q) INTO i; IF i IS NULL THEN RETURN 0; ELSE RETURN i; END IF; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION reset_seq(table_name VARCHAR(100), pk_name VARCHAR(100), seq_name VARCHAR(100)) RETURNS INTEGER AS $$ DECLARE i INTEGER; BEGIN SELECT get_pk_max(table_name, pk_name) INTO i; RETURN setval(seq_name, i + 1, false); END; $$ LANGUAGE plpgsql;

The reset_seq() function is called with from the PostgreSQL client for each table by a shell script:

for T in `cat ${DB_INFO_DIR}/tables`; do if [ -f ${DB_INFO_DIR}/${T}.sequence ]; then echo "Processing $T" PK=`gawk 'BEGIN { RS="\000"; FS="\n" }; { print $1 }' ${DB_INFO_DIR}/${T}.sequence` SEQ=`gawk 'BEGIN { RS="\000"; FS="\n" }; { print $2 }' ${DB_INFO_DIR}/${T}.sequence` psql -c "SELECT reset_seq('ll_${T}', '${PK}', '${SEQ}');" else echo "No sequence information for $T" fi done

This loops over every table name in the new database. The table name, prefixed with ll_, is the first argument to reset_seq(). The second and third arguments are read from a file containing two lines; the first line contains the name of the table's primary key, and the second line contains the name of the sequence.

The shell script executes very quickly, resetting all the sequences in the database in a few seconds.

MSSQL 2000 to PostgreSQL 8 Migration Project

The following are notes on a migration from Microsoft SQL Server 2000 to PostgreSQL 8. The old database has 60 tables, some with a few million rows, and about 7 Gb of data in total. There were several issues to deal with:

  • The order of the columns in the tables of each schema aren't consistent. The older database has had columns added over time, while the new database schema has been created from the most current SQL scripts.
  • The data exported using bcp will need reformatting to be imported using psql.
  • Fields with NULL values will need special consideration. The bcp utility is not consistent in the use of 0x00 (the null character) to indicate a null value.
  • The MSSQL database has a mix of character encodings; The PostgreSQL database uses UTF8.
  • The import must be automated so it can be performed after the existing database is taken offline, and before the new one is put into production. This ensures that all user information is consistent.
  • The data will not be imported in a single transaction. Therefore, the order of the import must be correct for the foreign key relationships that exist.
  • The indexes and sequences of the new database must be set. (TODO)

The MSSQL database is hosted on a Windows server in a remote facility. I can open a VNC connection through an SSH tunnel, but the responsiveness is unacceptable. Instead, I use the Cygwin port of the OpenSSH daemon, running as a service, and providing a bash shell.

The first step in preparing to export the data is to get a list of table names. I used the isql utility to connect to the server:

isql -Q "select name from sysobjects where type = 'U'" \ -S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD \ | grep -Eo 'tbl_[a-z_]+' \ | sed -s 's/tbl_//g' \ | sort > tables

The result is file containing one table name (with the tbl_ prefix removed) per line. I copied the file as tables_to_export, and removed the tables that were not necessary to export. Incidentally, I don't know if the Microsoft documentation on these utilities will be accessible in the future. To be safe, I've created PDF archives of the pages on MSDN. The following bit of shell script loops over each line in the list of tables to be exported:

TEMPFILE=`mktemp` for T in `cat tables_to_export`; do bcp "tbl_${T}" format $TEMPFILE -f "${EXPORT_DIR}/${T}.fmt" -c -k \ -S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD bcp "tbl_${T}" out "${EXPORT_DIR}/${T}" -o "${EXPORT_DIR}/${T}.out" -c -k \ -r '_~R~_' -t '_~F~_' \ -S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD done

There are two runs of bcp for each table; the first documents the format of the table being exported, and the second performs the actual data export. The command line switches -c and -k specify character data, and to keep null values, rather than export with column defaults. I selected _~R~_ and _~F~_ for row and field terminators, rather than \n and \t, because the data itself contains new lines and tabs. After completing the export, I created a tarball and transferred it to an administration shell account on the same network as the new server. This little detail is important -- the new PostgreSQL server does not have file system access to the data files. See the documentation on the COPY command for more information.

The tarball is extracted to the local /tmp directory, rather than the NFS-mounted home directory for speed -- an important consideration when processing large files. The first task is to escape the embedded backslash, new line, and tab characters, convert the nulls, and then strip out the row and field terminators. A fairly straightforward Gawk program does the work:

BEGIN { RS="_~R~_" FS="_~F~_" OFS="\t" } { gsub(/\\/, "\\\\") gsub(/\000/, "\\N") gsub(/\r\n/, "\\n") gsub(/\n/, "\\n") gsub(/\t/, "\\t") $1 = $1 print $0 }

The data in some files required extra massaging, such as inserting missing null values or removing obsolete fields. For each data file that requires extra processing, I created a file of the same name containing additional gawk rules. The processing script checks for the extra rule files and applies them when found. The new files are created with .tab extensions, indicating that the file is ready for import.

This result is a set of data files that can be used by psql. However, the order of the columns must be dealt with. For that, I wrote a script to loop over all of the exported table names (bundled into the tarball), and compare the column order to the new database:

# Create a column list using bcp format files for T in `cat ${DBOLD_DIR}/tables_to_export`; do gawk 'BEGIN { FS="[ ]+" } /[ ]+/ { print $7 }' \ "${DBOLD_DATADIR}/${T}.fmt" > "${DBOLD_WORK}/${T}.columns" done # Create a list of all the tables in the new database psql -c '\dt *' | gawk 'BEGIN { FS=" " } /public/ { print $3 }' \ | sed -e 's/^ll_//' > $DBNEW_TABLES # Create column list for each table in the new database for T in `cat $DBNEW_TABLES`; do psql -c "\\d ll_${T}" \ | gawk 'BEGIN { FS=" " } /Column/ { next } /^[ ][a-z]/ { print $1 }' \ > "${DBNEW_WORK}/${T}.columns" done # Compare each exported table to the new table for T in `cat ${DBOLD_DIR}/tables_to_export` ; do echo "Table: $T" if [ ! -f ${DBNEW_DIR}/${T}.columns ]; then echo "Does not exist in new database." else # Quietly check for differences diff -w -i -q "${DBNEW_DIR}/${T}.columns" "${DBOLD_DIR}/${T}.columns" > /dev/null if [ $? != 0 ]; then # Show differences side-by-side diff -w -i -y "${DBNEW_DIR}/${T}.columns" "${DBOLD_DIR}/${T}.columns" else echo "No differences." fi fi echo done

I went to the effort to programmatically compare the old tables to the new tables for two reasons: 1) it's highly likely that I would have made errors doing it by visual inspection, and 2) the resulting output is used to create a column list used during the import of the data. These column list files are placed into a directory that will be checked by the import script. If a file is found matching the name of the table, the column list is used instead of a straight copy.

I don't yet have a tool for examining the SQL scripts, and determining the dependency order. Fortunately, this only needs to be done once (for this project). I created a file containing the table names in the order that they must be imported.

Finally, the data can be imported!

for T in `cat ${DB_IMPORT_INFO_DIR}/table_import_order`; do echo "Processing $T" # If the column order is special, use the column list if [ -f ${DB_IMPORT_INFO_DIR}/${T} ]; then Q=`head -n 1 ${DB_IMPORT_INFO_DIR}/${T}` psql -c "COPY ll_${T} (${Q}) FROM STDIN" < ${DB_DATA_DIR}/${T}.tab # Otherwise, perform a straight copy else psql -c "COPY ll_${T} FROM STDIN" < ${DB_DATA_DIR}/${T}.tab fi done

I should note that I set the PGCLIENTENCODING environment variable to LATIN1 in the import script. This makes the conversion to UTF8 explicit, and eliminates any byte sequence errors. Initially, I tried setting \encoding LATIN1 in the ~/.psqlrc file, as well as specifying -v encoding=LATIN1 on the psql command line -- neither of these methods cause the correct character encoding to be set during the import.

As indicated above, the SQL to set the indexes and sequences needs to be completed. Other than that, the project is nearly complete.

PostgreSQL Autocommit

This is a note to myself, and any other PostgreSQL user that might also be bitten by this circumstance. I like to have my psql client configured without AUTOCOMMIT enabled. This allows me to run long scripts that don't change anything unless it all succeeds. So, I set it to false in my ~/.psqlrc file.

Recently I was making some changes to a database table and got sidetracked before committing the ALTER TABLE. Hours later, I wondered why my web app was completely unresponsive for some requests. It turns out that the JDBC driver was locked (and doesn't timeout) because of my open psql connection. D'oh! How easy it is to overlook the simple things.