Wednesday, November 12, 2008

Analytical Approach To Solving Programming Problems

In the six months that passed since I updated this blog I've been working on various web application projects, learning a lot about ASP.Net Ajax and Web Client Software Factory. Nevertheless, this posting isn't about any particular technology. In my opinion, software developers already have way too many technologies, frameworks, programming languages, and APIs available to us. It's a challenge just to keep up with all the new stuff that comes out. What I want to discuss instead are the benefits of the analytical approach to programming problems.

Here is a sample problem. Imagine there is a virus spreading through the cells of a very large two-dimensional matrix. We start with a relatively healthy matrix with only 10 random cells infected. The virus is spreading by infecting 4 adjacent cells every minute. For example, if "." represents a healthy cell, this is how the epidemic will progress:







Start

.........
.........
....0....
.........
.........


After first minute

.........
....1....
...101...
....1....
.........


After second minute

.........
....2....
...212...
..21012..
...212...
....2....
.........


Of course, the virus starts spreading from 10 different places on the surface, so depending on where these cells are, the time it takes to infect entire matrix can vary. Our task is to find that time given 10 initial locations.

It may be tempting to rely on a raw processing power of modern computers and concoct a solution that looks like this:

while (!matrix_is_fully_infected)
{
infect_next_set_of_cells();
}

The model above simply recreates the behavior of the virus. The obvious drawback here is the sheer inefficiency of the algorithm: we end up scanning entire matrix an unknown number of times. As matrix size increase, the inefficiency will be more evident. Still, this may be a valid approach in some cases, where there is no easy analytical solution. Fortunately, our virus has a primitive DNA and yields itself to mathematical definition.

For simplicity, let's assume that we begin with a single infected cell with coordinates (a,b). The number of minutes it takes to infect an arbitrary cell (x,y) can be expressed with this simple formula: |a-x|+|b-y|. Now let's assume we had a second infected cell at the beginning: (c,d). We could use a similar formula to find out how many minutes it will need to infect our arbitrary cell (x,y): |c-x|+|d-y|.

Depending on whether (a,b) or (c,d) is located closer to (x,y), one of the above expressions will produce a smaller number of minutes. This will be the answer to the question "how long it takes to infect a single arbitrary cell". As we go from 2 infected cells to the original 10, we can write the answer as a function of (x,y):

min(|ai - x| + |bi - y|), where 1 <= i <= 10

Of course, our job is not done yet - the virus doesn't stop until all cells are infected. What we need to find out is how many minutes it will take to infect the last cell. Evidently, this will be the maximum time across the matrix, so our solution will be to take the maximum of the above function:

max( min(|ai - x| + |bi - y|) ), where x and y vary across matrix dimensions

As you can easily see, analytical approach provided significant performance improvement - we now only need to scan the matrix once.

Thursday, May 01, 2008

Automatic Deployment of SQL Scripts

1. Background

In my article "Introduction To Change Management" I wrote about the fundamental flaw in the deployment process:

Developers have the first-hand knowledge about the deployment artifacts, but they rarely have security privileges on servers and databases outside of the development environment. DBA's and operations, on the other hand, have permissions and are usually in charge of performing deployments, but they rarely have a good understanding of the deployment artifacts and their relationships.

Suppose we need to deploy a new build of assembly Foo.dll. A method inside it relies on a specific version of stored procedure dbo.Bar, so updated SQL script must be deployed together with the assembly. This is where our process becomes vulnerable to human errors, since IT, DBA, build engineer, and release coordinator all have an opportunity to break the deployment. Take into consideration the sheer number of components and database objects, and you can understand why it usually takes several iterations to make a successful push.

The standard two-track deployment process has other drawbacks, too. For example, it turns DBAs into very expensive clerks as they mindlessly combine scripts together and press "F5" button. Another problem is that it creates a false illusion that database scripts can be safely deployed as a hotfix (while binary code requires regression testing). In reality, both application and database code must be treated as a single entity.


2. Solution

My solution is to bundle all SQL scripts that are required by an assembly with the assembly itself using embedded resources. During deployment, IT engineer will need to invoke standard .NET installation utility:

Installutil.exe <assemblyname>.dll

This invokes a custom installer class which will enumerate script resources in the assembly and execute them in a predefined order.

Although it sounds pretty simple, the above approach has a potential to eliminate a lot of deployment issues, because it gives the developer - someone intimately familiar with implementation details - full control over database deployment process.

It also encourages a modular approach to software design by reducing external dependencies of the assembly.


3. How To Bundle Scripts With Assembly

3.1 Adding resources.

Adding SQL scripts to assembly is a very straightforward procedure. In Visual Studio Solution Explorer, right-click project name and choose "Add Existing Item" from the context menu (it is even better to create a subfolder for these files). After a file was added, open its properties and change Build Action from "None" to "Embedded Resource".

3.2 Marking resources as SQL scripts.

When deploying SQL scripts, order is very important. For example, if you have a stored procedure which selects records from table dbo.Customers, you have to create the table before you can create the stored procedure. So, in order to maintain the order, I created a new custom attribute SqlScriptResourceAttribute:

[AttributeUsage(AttributeTargets.Assembly, Inherited = false, AllowMultiple = true)]
public sealed class SqlScriptResourceAttribute : Attribute
{
public SqlScriptResourceAttribute(string scriptName, int scriptSequence)
{
this.ScriptName = scriptName;
this.ScriptSequence = scriptSequence;
}
}

When using the attribute, it is important to specify fully qualified name of the script, which is assembly name + subfolder name (if any) + file name. In the example below, table creation script t_Customer.sql will be executed before other scripts.

[assembly: SqlScriptResource("MyAssembly.pr_Customer_s.sql", 2)]
[assembly: SqlScriptResource("MyAssembly.pr_Customer_i.sql", 3)]
[assembly: SqlScriptResource("MyAssembly.t_Customer.sql", 1)]

3.3 Adding Custom Installer

In order to be able to run installutil.exe against your assembly, you need to add a custom installer to it. This is done using a few lines of code:

using System.ComponentModel;
using System.Configuration.Install;

namespace MyAssembly
{
[RunInstaller(true)]
public class MyAssemblyInstaller : Installer
{
public MyAssemblyInstaller()
{
ScriptInstaller installer = new ScriptInstaller(this.GetType().Assembly);
this.Installers.Add(installer);
}
}
}

All we need to do is to initialize an instance of ScriptInstaller class with the reference to current assembly and add it to the Installers collection.


4. How To Implement ScriptInstaller

ScriptInstaller is a subclass of System.Configuration.Install.Installer. It overrides standard Install(IDictionary stateSaver) method and implements SQL script deployment logic.

4.1 Enumerating Scripts.

I use Reflection to retrieve embedded resource names and list of SqlScriptResourceAttribute instances, then cross-check them to ensure integrity. Finally, I add all script resource names to a SortedList, ordering them by execution sequence.

private SortedList<int, string> BuildScriptList()
{
SortedList<int, string> scripts = new SortedList<int, string>();
List<string> resources = new List<string>(_currentAssembly.GetManifestResourceNames());

object[] attributes = _currentAssembly.GetCustomAttributes(typeof(SqlScriptResourceAttribute), false);
if (attributes != null)
{
foreach (object item in attributes)
{
SqlScriptResourceAttribute attrib = (SqlScriptResourceAttribute)item;
if (resources.Contains(attrib.ScriptName))
{
scripts.Add(attrib.ScriptSequence, attrib.ScriptName);
}
else
{
Context.LogMessage(string.Format("## Script {0} not found in the current assembly.", attrib.ScriptName));
}
}
}
return scripts;
}

4.2 Loading Scripts.

Loading scripts from the assembly is accomplished using a GetManifestResourceStream method of the Assembly class:

foreach (string scriptName in BuildScriptList().Values)
{
Context.LogMessage(string.Format("## Installing script: {0}", scriptName));
Stream resourceStream = _currentAssembly.GetManifestResourceStream(scriptName);
string sqlScript;
using (StreamReader sr = new StreamReader(resourceStream))
{
sqlScript = sr.ReadToEnd();
}
DeployScript(sqlScript);
}


4.3 Deploying Scripts.

Although this is essentially a standard ADO.NET ExecuteNonQuery() operation, there are two potential caveats. First is database security. Whichever way your code normally builds database connection string, it is very unlikely you will be able to use it with DDL scripts. Commonly, you only get "execute" (and possibly "select") permissions at runtime. So, in order to successfully execute deployment scripts, you need to tweak the connection string to establish a more privileged security context. The best approach, in my opinion, is to use SQL Server Integrated Security and assume that the person executing the installutil command has relevant SQL Server permissions.

Second potential pitfall is related to the fact that most deployment scripts contain multiple batches of TSQL separated by the "GO" command, e.g.:

IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = OBJECT_ID(N'[dbo].pr_Customer_s') AND OBJECTPROPERTY(id,N'IsProcedure') = 1)
DROP PROCEDURE [dbo].pr_Customer_s
GO

CREATE PROCEDURE dbo.pr_Customer_s


Keyword "GO" is not part of TSQL language definition, so if you try to call ExecuteNonQuery on the above script directly, you will get an exception. What you need to do is split the script into individual batches using "GO" keyword as a guide, and then execute each batch individually:

string[] batches = sqlScript.Split(new string[] { "\r\nGO\r\n" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string sqlBatch in batches)
{
DbCommand cmd = db.GetSqlStringCommand(sqlBatch);
db.ExecuteNonQuery(cmd);
}

5. Wrapping Up

There are situations where the solution described in this article is either not applicable (for instance, when company ships their software on a CD with a special setup application) or doesn't add value (for example, when database engineers - not .NET programmers - are developing database scripts). But if your change management process separates database scripts from binary code deployment, you may have an opportunity to streamline it. Automated scripts installation will reduce the number of deployment issues, and DBAs will surely feel relieved that they no longer need to perform this mundane task.

Tuesday, March 11, 2008

WCSF 2.0 And Multiple Websites

Patterns and Practices team just released version 2 of the Web Client Software Factory. Among new features are implementation of MVP pattern for user controls and master pages, and a few AJAX extender controls.

PageFlow is not present in this release, and its future seems to be murky. I spent some time trying to implement PageFlow in my project and was disappointed by its complexity. Benefit to effort ratio was definitely too low; I eventually gave up on it.

I have followed the steps outlined in documentation in order to migrate my solution from June 2007 (1.1) release of the factory without much trouble. Unfortunately, there is still no recipe to add new website to the solution. So, if you need to have more than one, make a copy of the existing website and add it. Make sure little file called vwd.webinfo is present in the root of the website, otherwise WCSF will not "recognize" it - new site name will not appear in drop-down lists or will be grayed out when you execute various recipes from the guidance package. I don't believe this tip is documented.

Monday, January 28, 2008

LINQ To SQL Performance

In the interest of full disclosure, I must say that I like LINQ. I think the declarative approach to data manipulation is just wonderful! It makes code much cleaner and code maintenance much easier. There is an overhead, of course, but with LINQ to Objects all calls are in-process, so it's still going to be fast.

When it comes to LINQ to SQL, though, the game changes somewhat: not only database access is out-of-process, it is also notoriously tricky. Poorly designed queries can take minutes instead of seconds and drain server resources, so it is very important to know what T-SQL is being generated for a given LINQ query. But that is a topic for another post...

What I really wanted to know is - all other things being equal - how much performance overhead does LINQ to SQL add on top of ADO.NET. So, I built my test harness around one simple query against Northwind database:

SELECT [t0].[OrderID], [t0].[OrderDate], [t0].[CustomerID], [t1].[CompanyName], (
SELECT COUNT(*)
FROM [dbo].[Order Details] AS [t2]
WHERE [t0].[OrderID] = [t2].[OrderID]
) AS [ProdCount]
FROM [dbo].[Orders] AS [t0]
INNER JOIN [dbo].[Customers] AS [t1] ON [t0].[CustomerID] = [t1].[CustomerID]
WHERE [t0].[OrderID] = @OrderID

Basically, for a given order ID, we are retrieving a single row with information from order, customer, and order details tables:

OrderID OrderDate CustomerID CompanyName ProdCount
----------- ----------------------- ---------- ---------------------------------------- -----------
10248 1996-07-04 00:00:00.000 VINET Vins et alcools Chevalier 3

(1 row(s) affected)

In order to get measurable results, I wanted to run this query for every one of 830 orders (independently, to simulate a multi-user application). For benchmarking, I used three alternative approaches: dynamic SQL, dynamic SQL with parameters, and stored procedure. Because SQL Server caches query execution plans, I restarted it before switching to a different approach.

Dynamic SQL

In the method below, I am concatenating order ID value to the end of the WHERE clause. This technique is known to have poor performance, because SQL Server doesn't realize we are using the same query and will have to compile it every time. Indeed, the initial run took approximately 4100 ms on my laptop. SQL Server caches query plans, though, so subsequent executions of the test yielded a much better result of roughly 200 ms.
private TimeSpan RunADOTest()
{
using (SqlConnection conn = new SqlConnection(ConfigurationManager.ConnectionStrings["Northwind"].ConnectionString))
{
conn.Open();
SqlCommand cmd = conn.CreateCommand();
cmd.CommandType = CommandType.Text;
DateTime dtStart = DateTime.Now;
for (int orderId = 10248; orderId < 11078; orderId++)
{
cmd.CommandText = @"SELECT [t0].[OrderID], [t0].[OrderDate], ...
WHERE [t0].[OrderID] = " + orderId.ToString();
SqlDataReader dr = cmd.ExecuteReader();
dr.Close();
}
return DateTime.Now.Subtract(dtStart);
}
}

Parameterized Dynamic SQL

The only difference from previous method was that query WHERE clause changed to "WHERE [t0].[OrderID] = @OrderID". I also added these lines before calling cmd.ExecuteReader():

cmd.Parameters.Clear();
cmd.Parameters.Add(new SqlParameter("@OrderID", orderId));

As expected, performance has improved. After SQL Server restart, the method completed in 460 ms, and subsequent executions were around 190 ms. It's also important to realize that SQL Server has cached only one query plan and not 830 as in the previous example.

Stored Procedure

Ever since Microsoft added query plan caching for dynamic queries in SQL Server 2000, there really is no performance difference between stored procedure and a parameterized dynamic query. Of course, there are many good reasons for writing stored procedures (greater security, better code reuse, smaller network traffic).

In my test harness, results were nearly identical to parameterized dynamic SQL: initial run (after SQL Server restart) took 453 ms, and subsequent executions took 187 ms.

LINQ To SQL

I dropped Customers, Orders, and Order Details tables to the surface of Visual Studio object relational designer to create a LINQ to SQL classes. In order to ensure that all calls are executed using a single database connection, NorthwindDataContext is initialized with an open SqlConnection object. I also wanted to make sure T-SQL generated by LINQ is the same as in previous tests, so I installed SqlServerQueryVisualizer component and ran a SQLProfiler trace. Indeed, both Parameterized DSQL and LINQ to SQL tests issued the same exact exec sp_executesql command. I had to use query variable in a foreach statement to make it run (because of deferred execution).

Test results were, frankly, disappointing. After SQL Server restarted, the method took 4250 ms. Subsequent executions yielded between 3300 and 3400 ms, or more than 10 times slower than all other tests.

private TimeSpan RunLINQTest()
{
DateTime dtStart = DateTime.Now;
using (SqlConnection conn = new SqlConnection(ConfigurationManager.ConnectionStrings["Northwind"].ConnectionString))
{
conn.Open();
NorthwindDataContext db = new NorthwindDataContext(conn);

for (int orderId = 10248; orderId < 11078; orderId++)
{
var query = from o in db.Orders
join c in db.Customers on o.CustomerID equals c.CustomerID
where o.OrderID == orderId
select new
{
o.OrderID,
o.OrderDate,
o.CustomerID,
c.CompanyName,
ProdCount = o.Order_Details.Count
};

foreach (var item in query)
{
string s = item.CompanyName;
}
}

return DateTime.Now.Subtract(dtStart);
}
}

Test Summary

The table below summarizes test results. As you can see, after initial query plans are cached, dynamic SQL and stored procedure have essentially the same performance. Either approach is 94% faster than LINQ to SQL.

Can't say I'm happy with these results (after all, I do like LINQ), but at this point I don't see what else can be contributing to the delay. Unfortunately, I only have Professional version of Visual Studio which doesn't include a profiler, so I can't research further. However, I welcome any comments or corrections. Source code of the test harness is posted here: http://members.cox.net/rmamedov/blog/LINQ2SQLPerformanceTest.zip.

Initial Subsequent

Dynamic SQL 4100 200

Parameterized
Dynamic SQL 460 190

Stored Procedure 453 187

LINQ to SQL 4250 3350