Lessons learned porting from SQL Server to SQLite

After a sudden reduction in force at the day job I inherited a crufty old reporting system from a team of DBAs. We're a company full of Linux and a few legacy Solaris systems but the DBAs who built this thing almost 15 years ago must have insisted on SQL Server 2005 so I was the proud owner of one of the few Windows servers on our side of the business. After three months of wrangling with the report I was able to get it off of the Microsoft stack just before the Windows sysadmins decided it was time to pull the plug on this long forgotten and neglected Windows Server 2003 system. I chose Python and sqlite3 as the replacements for a mix of SQL Server queries and jobs and Visual Basic macros embedded in Excel templates for a bunch of reasons that aren't relevant here but I do have some notes on problems I ran into while moving everything over:

Collations

The default collation for SQL Server 2005 is something called SQL_Latin1_General_CP1_CI_AS. It is case insensitive. SQLite is case sensitive and the only way to change this is by declaring every text column as COLLATE NOCASE. Welcome to hell.

Clustered indexes

Some background information: in DBA speak a clustered index/table/column refers to the order of data on disk. If a table is clustered on an index or column it means that the underlying table on disk has its data in the collation order for that index or column and updates to the table will keep that order. SQL Server supports clustered indexes in a normal way. In SQLite there is no way to cluster on a given column. The underlying storage of the table does whatever it wants.

Unfortunately my reporting database relied heavily on SQL Server’s clustered indexes. Most of the data was in two tables with around 50 sparsely populated columns each, a single clustered index on the leading 7 columns or so and no primary keys or UNIQUE constraints. This sounds strange but the report was synthesizing data across multiple internal company databases whose data was populated by devices from many different vendors so it needed the flexibility while it was normalizing data from all of these sources. A typical query might have a WHERE clause that matched on the first 4 or 5 indexed columns and the remaining data would be read in a partial table scan. Thanks to the on-disk ordering provided by the clustering this was sequential, predictable I/O and SQL Server did a pretty good job with it, often beating or tying SQLite which was running on a modern i7 with an SSD.

The first, naive port over to SQLite had quite a few performance problems which I will get into shortly. The original database’s reliance on its single, giant clustered index thrashed the performance of virtually all queries queries across the entire run so I wound up rewriting many queries to use SQLite’s hidden rowid feature. Internally SQLite keeps a b-tree index to store and retrieve the physical rows on disk and the key to this index is exposed as the hidden rowid column. It is very fast to read and write queries keyed on the rowid and I made extensive use of it. As an example, here’s what a typical SQL Server query looked like:

UPDATE big_table
SET blade = tmp_table.blade, port = tmp_table.port
FROM big_table
INNER JOIN tmp_table
        ON cmts_name = tmp_table.deviceName
        AND snmp_index = tmp_table.IndexNum
        AND time_stamp = ?

This becomes two SQLite queries[1]:

SELECT blade, port, rowid
FROM big_table
INNER JOIN tmp_table
        ON cmts_name = tmp_table.deviceName
        AND snmp_index = tmp_table.IndexNum
        AND time_stamp = ?

and

UPDATE big_table
SET blade = ?, interface_port = ?
WHERE rowid = ?

Python object performance

Python's sqlite3 module has the ability to convert values to and from Python objects as they are read and written to the database. By default it ships with adapters for converting timestamps to the Python datetime class and converting arbitrary precision decimal types to Decimal objects. Both reading and writing adapters for both types benefitted from memoization decorators that cached the instantiation of those types. The increased memory usage was negligible for our data set because it had lots of identical values. Memoization successfully moved the bottleneck in some parts of the report such as loading data out of an Oracle database into SQLite from being limited on CPU usage in Python to being limited by the Oracle database's I/O speed.

One consequence of memoization and Python's reference types is that a single instance of a value will be used for every instance of the value. You must take care not to mutate that object in Python code or you will break every object and every future object that tries to reference that value. It's best to keep as much work in SQLite as possible as both a performance optimization and to minimize errors like this one.

The default adapter for the Python datetime class stores an ISO 8601 formatted string in the database. This string's lexicographical order gives sane results with sorts and inequalities but takes up lots of space on disk and prevents you from doing math on dates in queries. I replaced the default datetime adapter/converters with ones that used time.mktime to get UNIX timestamps out of a datetime and stored the resulting integer (I didn't have any fractional seconds in any timestamps). This let me replace the report's frequent usage of the DATEADD function with arithmetic on integer seconds in all but one case. For the one place where date math was required I implemented the function in Python, registered it as a function in SQLite and even got to take advantage of memoization on the Python side.

General performance issues

SQLite's query planner is not nearly as sophisticated as SQL Server's planner and most queries with large, multi-table joins had to be rewritten. The naive SQLite port of these queries would spin for hours and never complete. I had success with decomposing joins into a chain of smaller queries that would do a single join each and write their results to a temporary table, eventually ending with one last join to update the main table. The main data populating query in SQL Server joined five subquery tables onto a single large table in a ~100 line query. It was decomposed into four main temporary tables that held intermediate join results and several smaller temporary tables for the results of subquery processing which broke the giant query into manageable bits. You must take care to make sure that functions whose results depend on the order of arguments (such as coalesce) are kept in order across the smaller queries.

SQLite puts out a new release every few months. There's usually a few new features, minor performance optimizations and bug fixes in each release. Our report spends at least an hour and a half of its time in parts where SQLite is the bottleneck which is enough time to warrant keeping our own SQLite built locally and up to date. If you're on a Linux distribution that ships a few years old version of SQLite I would highly recommend you use something modern. Updating to a new version has always been painless, SQLite's code is well-tested and release management is conservative.

At some point I switched the project over to WAL logging. Although I no longer have the numbers I believe this was a win for us as it speeds up the initial bulk insert of data at the start of the report. Up until version 3.11 (released February 2016) WAL mode performed very poorly with large log and transaction sizes. Our logs for the bulk import are gigabytes in size requiring us to use the newer releases of SQLite and gave us another win from keeping SQLite up to date. The original design of the report carved its queries up into a series of about 40 high-level steps and I kept this design with the SQLite port. These steps gave us clean locations to commit the database which keeps the log size down and lets SQLite run its WAL checkpointing at regular intervals.

Database functions and features

SQL Server ships with tons of built-in functions and the original authors of the report made good use of them. I wrote a case-insensitive regular expression to match the functions and syntax that had to be changed and that helped a lot with fixing everything while porting the queries over. Here are some of the things that I had to change:

SUBSTRING

It's called substr in SQLite and there's no change in behavior.

LEN

It's called length in SQLite and there's no change in behavior. I'm kind of sad that SQLite was inconsistent with substr and length.

CHARINDEX

It's called instr in SQLite and the arguments are flipped. If you forget to flip the needle and haystack the query will still work but will return incorrect results. This was a huge pain.

RIGHT and LEFT

In SQL Server RIGHT(string, N) returns the rightmost N characters of a string. I replaced this with substr(string, -N) in SQLite. For LEFT, just use substr and keep the same N argument.

SELECT…INTO

This syntax doesn't exist in SQLite but it's not too hard to change it to INSERT INTO…SELECT.

CONVERT

SQL Server's CONVERT(type, value) is cast(value AS type) in SQLite.

ISNULL

This is just COALESCE with two values. Replace it with COALESCE in SQLite.

TOP 10

SQL Server doesn't have the normal LIMIT functionality yet (or at least it didn't in SQL Server 2005). Convert these to a LIMIT, it's easy to match in your regex.

INSERT

The INTO after INSERT is optional in SQL Server but required in SQLite. Doesn't hurt to throw insert\s+(?!into) into your regex to catch this.

+ (string concatenation)

Strings are concatenated with the SQL standard || in SQLite. There were enough string concatenations in the SQL Server code that it made sense to put + in my regex.

Temporary tables

SQL Server has a neat syntax for creating temporary tables: prepend the table name with a #. You can create temporary tables explicitly with it (CREATE TABLE #asdf) or implicitly (INSERT INTO #asdf SELECT …). SQLite requires all temporary tables to be explicitly declared with CREATE TEMPORARY TABLE asdf

. If you have any text fields don't forget to declare them as COLLATE NOCASE in the temporary definitions as well! I had # in my regex and that worked well to find all the temporary table declarations.

Numeric types

SQL Server lets you declare columns as a numeric type and choose a precision and scale. See the docs for a thorough explanation. The database will enforce the precision by rounding numbers if you insert them into a column with a smaller precision. SQLite supports a NUMERIC type that preserves precision but does not allow fine-grained control over rounding like SQL Server does. The rules for the NUMERIC types are documented here. If you are not familiar with SQLite's type affinities I strongly urge you to read that documentation as SQLite's type system is unlike the ones in any other relational database.

openpyxl

The finished product is an Excel spreadsheet. It has enough formatting to look nice but nothing insanely complex. I used the openpyxl library to write the spreadsheet and it got the job done. The library supports lots of XLSX features but is not totally feature complete: development is ongoing and there are frequent new releases. There has been lots of minor API changes as new features are added so be prepared to read source code and the documentation to figure out how to use the library. StackOverflow is often out of date. The openpyxl code is not fast, taking over 45 minutes to create a ~20 megabyte spreadsheet. In all fairness to the openpyxl authors I did not attempt to optimize this code, once my naive attempt was working correctly it was good enough for my needs. If you have the lxml library installed it'll be autodetected by openpyxl and give you a measurable but minor speedup.

Conclusion and recommendations

At the end of the day the project worked: SQLite was a capable replacement for the much more sophisticated SQL Server and the aging Windows server hardware could be decommissioned. The Python code was vastly more reliable and debuggable than the mess of SQL Server jobs and Excel macros. I was happy with SQLite's performance and flexibility in this exotic situation and I'd use it again. If I had to recommend anything to make it easier I'd say that SQLite should support setting collation across an entire database. It is very tedious to catch every single text column to make sure it's functioning correctly. I agree that the SQL Server default is braindead but needing a custom collation is not so exotic that it can't be supported. I also think it would be neat if SQLite (which supports loading libraries at runtime to extend the database) shipped with an optional library with compatible implementations of functions available in 3rd party databases.


1. SQLite does have the ability to use a SELECT in an UPDATE/DELETE although not with the exact UPDATE…FROM schema that SQL Server has. I don’t remember exactly what issues I ran into with it, some may have been performance/functionality and lack of full JOIN support. I also believe I had to rewrite queries to get an accurate count of how many rows we were mutating which was an important feature for us. At any rate I did make use of rowid in many places to optimize deletes and updates. And I swear that the SQLite documentation had a whole bunch of stuff on semi-joins (the EXISTS operator) but it looks they've ripped that out of the documentation sometime in the past few months and you can just use regular joins everywhere.