SQL Azure just got some better pricing! Here are the details: Database Size Price Per Database Per Month 0 to 100 MB Flat $4.995 Greater than 100 MB to 1 GB Flat $9.99 Greater than 1 GB to 10 GB $9.99 for first GB, $3.996 for each additional GB Greater than 10 GB to 50 GB $45.954 for first 10 GB, $1.998 for each additional GB Great than 50 GB to 150 GB $125.874 for first 50 GB, $0.999 for each additional GB Notice the new 0 to 100 MB tier – finally, a good option for small databases, utility databases, blogs, etc. Note, however, that when setting up a database, there is a maxsize property – currently, the maxsize can be set to 1 GB, 5 GB, 10 GB, and then in 10 GB increments up to 150 GB. (The 1 GB and 5 GB belong to the Web Edition, and the larger are part of the Business Edition. Both offer the same availability/scalability.) So, if a database is set to maxsize of 1 GB, as long as the size stays at or below 100 MB, the reduced pricing will be in effect. The price is calculated daily based on the peak size of the database for that day, and amortized over the month. This is a breakdown of the changes from the previous pricing model: GB Previous Pricing New Pricing New Price/GB Total % Decrease 5 $49.95 $25.99 $5.20 48% 10 $99.99 $45.99 $4.60 54% 25 $299.97 $75.99 $3.04 75% 50 $499.95* $125.99 $2.52 75% 100 $499.95 * $175.99 $1.76 65% 150 $499.95* $225.99 $1.51 55% *Previous prices 50GB and larger reflect price cap of $499.95 announced December 12, 2011. For more information, check out the Accounts and Billing in SQL Azure page. Also, my colleague Peter Laudati has a nice write upon the changes!
While building the back end to host our “Rock, Paper, Scissors in the cloud” game, we faced a situation of where/how to store the log files for the games that are played. In my last post, I explained a bit about the idea; in the game, log files are essential at tuning your bot to play effectively. Just to give a quick example of what the top of a log file might look like: In this match, I (bhitney) was playing a house team (HouseTeam4) … each match is made up of potentially thousands of games, with one game per line. From the game’s perspective, we only care about the outcome of the entire match, not the individual games within the match – but we need to store the log for the user. There’s no right or wrong answer for storing data – but like everything else, understanding the pros and cons is the key. Azure Tables We immediately ruled out Azure Tables, simply because the entity size is too big. But what if we stored each game (each line of the log) in an Azure Table? After all, Azure Tables shine at large, unstructured data. This would be ideal because we could ask specific questions of the data – such as, “show me all games where…”. Additionally, size is really not a problem we’d face – tables can scale to TBs. But, storing individual games isn’t a realistic option. The number of matches played for a 100 player match 4,950. Each match has around 2,000 games, so that means we’d be looking at 9,900,000 rows per round. At a few hundred milliseconds per insert, it would take almost a month to insert that kind of info. Even if we could get latency to a blazing 10ms, it would still take over a day to insert that amount of data. Cost wise, it wouldn’t be too bad: about $10 per round for the transaction costs. Blob Storage Blob storage is a good choice as a file repository. Latency-wise, we’d still be looking at 15 minutes per round. We almost went this route, but since we’re using SQL Azure anyway for players/bots, it seemed excessive to insert metadata into SQL Azure and then the log files into Blob Storage. If we were playing with tens of thousands of people, that kind of scalability would be really important. But what about Azure Drives? We ruled drives out because we wanted the flexibility of multiple concurrent writers. SQL Azure Storing binary data in a database (even if that binary data is a text file) typically falls under the “guilty until proven innocent” rule. Meaning: assume it’s a bad idea. Still, though, this is the option we decided to pursue. By using gzip compression on the text, the resulting binary was quite small and didn’t add significant overhead to the original query used to insert the match results to begin with. Additionally, the connection pooling makes those base inserts incredibly fast – much, much faster that blob/table storage. One other side benefit to this approach is that we can serve the GZip stream without decompressing it. This saves processing power on the web server, and also takes a 100-200k log file to typically less than 10k, saving a great deal of latency and bandwidth costs. Here’s a simple way to take some text (in our case, the log file) and get a byte array of the compressed data. This can then be inserted into a varbinary(max) (or deprecated image column) in a SQL database: 1: public static byte Compress(string text)
3: byte data = Encoding.UTF8.GetBytes(text);
4: var stream = new MemoryStream();
5: using (Stream ds = new GZipStream(stream, CompressionMode.Compress))
7: ds.Write(data, 0, data.Length);
10: byte compressed = stream.ToArray();
12: return compressed;
And to get that string back:
1: public static string Decompress(byte compressedText)
5: if (compressedText.Length == 0)
7: return string.Empty;
10: using (MemoryStream ms = new MemoryStream())
12: int msgLength = BitConverter.ToInt32(compressedText, 0);
13: ms.Write(compressedText, 0, compressedText.Length - 0);
15: byte buffer = new byte[msgLength];
17: ms.Position = 0;
18: using (GZipStream zip = new GZipStream(ms, CompressionMode.Decompress))
20: zip.Read(buffer, 0, buffer.Length);
23: return Encoding.UTF8.GetString(buffer);
28: return string.Empty;
In our case, though, we don’t really need to decompress the log file because we can let the client browser do that! In our case, we have an Http Handler that will do that, and quite simply it looks like:
1: context.Response.AddHeader("Content-Encoding", "gzip");
2: context.Response.ContentType = "text/plain";
3: context.Response.BinaryWrite(data.LogFileRaw); // the byte array
Naturally, the downside of this approach is that if a browser doesn’t accept GZip encoding, we don’t handle that gracefully. Fortunately it’s not 1993 anymore, so that’s not a major concern.
I’ve done a number of talks lately on Worldmaps and typically in side conversations/emails, people are curious about the databases and converting IP addresses to geographic locations. And, often when you dive into using the data, it seems there are a number of performance considerations and I thought I’d share my input on these topics. First up, the data. Worldmaps uses two databases for IP resolution. The primary/production database is Ip2Location. I’ve found this database to be very accurate. For development/demo purposes, I use IPinfoDB. I haven’t had too much time to play with this database yet, but so far seems accurate also. The latter is free, whereas Ip2Location is not. In either case, the schema is nearly identical: The BeginIp and EndIp columns are a clustered primary key. In the case of IPinfoDB, there is no EndIp field (and it’s not really needed). When performing a resolution, a string IP address is converted into a 64 bit integer and then used in searching the table. That’s why having a clustered key on the BeginIp (and optionally EndIp) is crucial to performance. But it doesn’t stop there. The examples posted in the database’s respective home pages are accurate and simple, but need to be refactored for performance. For example, to do a simple resolution on Ip2Location, according to their docs:
SELECT * FROM dbo.Ip2Location WHERE @IpNum BETWEEN BeginIp and EndIp
And for IPInfoDB:
SELECT TOP 1 * FROM IPInfoDB where BeginIp <= @IpNum ORDER BY BeginIp DESC
Both of these methods are perfectly fine, particularly for use as generic samples. The second one is on the right track, but it doesn’t work for joins so if you’re querying over a range, you’d need to refactor. And in the first example, using a BETWEEN operator forces a clustered index scan when joining, killing the performance. If we run the first example across my minified Ip2LocationSmall table, we’ll see something like this (and this is running off of SQL Azure – the perf is pretty great compared to localhost!): We can also look at the time: Ouch! Now, it doesn’t seem too bad, but imagine doing thousands of these requests per minute, or doing large joins. The goal then is to provide some hints that will optimize the query, particularly for joins. Our indexes are correct, so we can rework the query to get rid of the BETWEEN operator – we can sacrifice a little readability and do something like:
SELECT * FROM ( select ( select MAX(beginip) from dbo.Ip2LocationSmall where BeginIp <= @IpNum ) as IP_Begin ) as foo INNER JOIN dbo.Ip2LocationSmall iploc ON iploc.BeginIp = foo.IP_Begin
The result: And the time shows some improvement: But the REAL benefit comes when we need to join. Suppose I’d like to get a list of the countries for a given map (which is a parameter called MapId):
SELECT DISTINCT(ip.CountryCode) FROM MapHits hits INNER JOIN dbo.Ip2LocationSmall ip ON hits.IpNum BETWEEN ip.BeginIp AND ip.EndIp WHERE MapId = @MapId
The query returns 95 rows, and executes in 16 seconds: In this case, we can refactor this using the method above to something like:
SELECT DISTINCT(CountryCode) FROM ( select IpNum, ( select MAX(beginip) from Ip2LocationSmall where BeginIp <= IpNum ) as IP_Begin from dbo.MapHits as hits where MapId = @MapId ) as foo INNER JOIN Ip2LocationSmall iploc ON iploc.BeginIp = foo.IP_Begin
Again, not as pretty looking, but boy what a difference: We went from 16,500 milliseconds to 260 – over 60x the performance! Mike @AngryPets would be proud. The reason for the perf gain is we were able to eliminate the nested loop, which is (in this case) scanning the entire clustered index for the matching rows. The second benefit is the ability to switch schemas easily between IP2Location and IPinfoDB, and we can additionally lose the EndIp column which trims the size of the table.
SQL Azure currently has fairly limited management capabilities. When you create a database, you receive an administrator account that is tied to your login (you can change the SQL Azure password, though). Because there is no GUI for user management, there’s a temptation to use this account in all your applications, but I highly recommend you create users for your application that have limited access. If you limit access to only stored procedures, you need to specify execute permissions. Assuming you want your connection to have execute permissions on all stored procedures, I recommend a new role that has execute permissions. That way, you can simply add users to this role and as you add more stored procedures, it simply works. To create this role, you can do something like this: CREATE ROLE db_executor
GRANT EXECUTE TO db_executor
Now in the master database (currently, you need to do this in a separate connection – just saying ‘use master’ won’t work) you can create your login for the database:
CREATE LOGIN MyUserName
WITH PASSWORD = 'Password';
In your application database, you need to create a user – in this case, we’ll just create a user with the same name as the login:
CREATE USER MyUserName FOR LOGIN MyUserName;
Next, we’ll specify the appropriate roles. Depending on your needs, you may need only datareader. I recommend db_owner only if necessary.
-- read/write/execute permissions
EXEC sp_addrolemember N'db_datareader', N'MyUserName'
EXEC sp_addrolemember N'db_datawriter', N'MyUserName'
EXEC sp_addrolemember N'db_executor', N'MyUserName'
-- only if you need dbo access:
EXEC sp_addrolemember N'db_owner', N'MyUserName'
You can continue to customize as necessary, as long as you are familiar with the appropriate T-SQL.