Wednesday, December 5, 2012

Don't concatenate URL parts in your code directly

I know this might seem very basic, but I often see the following constructs in the code:

var url = base_url + url_part;

I recently found a whole bunch of these in my own code :O

This can easily break, since one doesn't know if the variables contain starting or trailing slashes. Relying on their presence or absence can be dangerous, especially if the parts come from configuration files or any other source where they were entered by the user.

Consider the following cases:
base_url url_part base_url + url_part
1. http://host/abc/ /def/ http://host/abc//def/
2. http://host/abc /def/ http://host/abc/def/
3. http://host/abc/ def/ http://host/abc/def/
4. http://host/abc def/ http://host/abcdef/

As you see, case 4 is completely broken and case 1 is possibly broken depending on your routing mechanism.

Instead of plain concatenation I would recommend either using whatever method is available in your language, or write your own url_combine(part1, part2, ...) method. This method should insert missing slashes and remove extraneous ones (consider also remove padding spaces, if there are any).

As a result you will be able to combine the URL parts like this:

var url = url_combine(base_url, url_part);

All four of the above cases would result in "http://host/abc/def/" assigned to url. This would keep the code protected against the modification of the URL parts upstream.

The same approach can be applied to combining parts of local file system paths.

Wednesday, March 28, 2012

Performance Quirks of MySQL Information Schema

I decided to start documenting my experiences related to what I do. And what I do, is all kinds of tech related stuff at Enkora. I was considering starting a blog for a while, but somehow I could never get around to it. I was never into writing and always hated literature classes, but one's gotta learn!

Anyways, here goes the first post!

MySQL tends to provide an infinite source of entertainment. This time it was the information schema. Making queries against the information schema can be incredibly slow, and our application is of course caching them all. However, once the cache expires, it could take up to 30 seconds or more to repopulate the cache.

First, I found out that one needs to specify the table schema very explicitly, since if MySQL decides to go through all the schemas, it will take a while. Adding a schema table condition to all the tables did help, however there was one remaining query that stayed slow.

Dima Tisnek (my colleague and a local Python guru) was doing occasional testing/development against his own copy of the system, which meant that the cache was pretty empty most of the time. This annoyed him greatly. He asked me yet again to do something about the terrible waiting times and suggested more caching.

Here is the offending query:

SELECT
 r.table_name,
 r.column_name,
 r.referenced_table_name,
 r.referenced_column_name,
 c.column_key = 'PRI' as is_primary
FROM information_schema.KEY_COLUMN_USAGE r
LEFT JOIN information_schema.COLUMNS c
 USING (table_name, column_name, table_schema)
WHERE r.table_schema = 'some_schema' AND
 c.table_schema = 'some_schema' AND
 r.referenced_table_name = 'some_table'

After running EXPLAIN it became clear that the second table is joined with no keys! Apparently MySQL cannot use indexes for joins of information schema tables. This means that if one has a lot of table schemas running on one MySQL instance, the query would take a while.

I did not manage to make MySQL use the correct key (or actually any key) for information_schema.COLUMNS table and almost gave up on it, when it came to me: temporary tables! The solution was to copy the relevant rows of the  information_schema.COLUMNS into a temporary table, and join information_schema.KEY_COLUMN_USAGE with the temporary table.

Here is the final query:

DROP TEMPORARY TABLE IF EXISTS information_schema_tmp_columns;

CREATE TEMPORARY TABLE information_schema_tmp_columns
SELECT * FROM information_schema.COLUMNS c
WHERE c.table_schema = :table_schema;

SELECT
 r.table_name,
 r.column_name,
 r.referenced_table_name,
 r.referenced_column_name,
 c.column_key = 'PRI' as is_primary
FROM information_schema.KEY_COLUMN_USAGE r
LEFT JOIN information_schema_tmp_columns c
 USING (table_name, column_name, table_schema)
WHERE r.table_schema = 'some_schema' AND
 r.referenced_table_name = 'some_table'

This SQL works orders of magnitude faster! Just because it seems like there's a bug in the way MySQL is joining the information schema tables.

For the record I was using MySQL 5.5.14 .