Riyad's Technical Blog: Performance

Showing posts with label Performance. Show all posts

Friday, November 13, 2009

Why I Hate Entity Framework

A few months ago I took over development of an insurance-related system. It was a Windows Forms application created for a small workgroup. Original design utilized ADO.NET Entity Framework and it was my first serious encounter with the technology. Immediately, I sensed that something wasn't right: application was too slow. There were only a dozen or so tables and hardly any data in them, yet forms took whole seconds to load. Database updates were even more problematic - not only they took long time, but sometimes they failed for no apparent reason.

Initially, I tried to write it all off as part of a learning curve. I recreated entity model and ran a SQL Profiler in an effort to better understand the technology. Profiler results were simply shocking: instead of executing a half-dozen or so SELECT statements in response to user opening a record, Entity Framework generated hundreds of them (trace file was over 1Mb in size).

So, I remembered the little performance test I did for LINQ to SQL about two years ago and decided to expand it to include Entity Framework. New method looks similar to the one used to test LINQ to SQL:

private TimeSpan RunEntityTest()
{
var swatch = Stopwatch.StartNew();
NorthwindEntities db = new NorthwindEntities(
ConfigurationManager.ConnectionStrings["NorthwindEntities"].ConnectionString);

for (int orderId = 10248; orderId < 11078; orderId++)
{
var query = from o in db.Orders
    where o.OrderID == orderId
    select new
    {
        o.OrderID,
        o.OrderDate,
        o.Customers.CustomerID,
        o.Customers.CompanyName,
        ProdCount = o.Order_Details.Count
    };

foreach (var item in query)
{
    string s = item.CompanyName;
}
}
return swatch.Elapsed;
}

By the way, that original blog post has been criticized by an anonymous guest, who pointed out that my test routines for stored procedures, dynamic SQL and parameterized SQL were not equivalent to LINQ test, because I never read any values from SqlDataReader object after opening it. So, I changed the code in all three methods as follows:

SqlDataReader dr = cmd.ExecuteReader();
while (dr.Read())
{
string s = Convert.ToString(dr["CompanyName"]);
}
dr.Close();

I'm glad to say that this change didn't make any difference to test results: LINQ to SQL is still 40 times slower than stored procedures or dynamic SQL. But Entity Framework results were much worse: 2.2 times slower than LINQ to SQL. Here is the full table:

METHOD	AVERAGE TIME, ms
Stored Procedure	110
Dynamic SQL	115
Parameterized Dynamic SQL	126
LINQ to SQL	4,699
Entity Framework	10,547

So, in the end, I rewrote the data layer of the application using LINQ to SQL.

Updated test harness code is available here: http://members.cox.net/rmamedov/blog/EntityFrameworkTestHarness.zip

Wednesday, November 12, 2008

Analytical Approach To Solving Programming Problems

In the six months that passed since I updated this blog I've been working on various web application projects, learning a lot about ASP.Net Ajax and Web Client Software Factory. Nevertheless, this posting isn't about any particular technology. In my opinion, software developers already have way too many technologies, frameworks, programming languages, and APIs available to us. It's a challenge just to keep up with all the new stuff that comes out. What I want to discuss instead are the benefits of the analytical approach to programming problems.

Here is a sample problem. Imagine there is a virus spreading through the cells of a very large two-dimensional matrix. We start with a relatively healthy matrix with only 10 random cells infected. The virus is spreading by infecting 4 adjacent cells every minute. For example, if "." represents a healthy cell, this is how the epidemic will progress:

Start


.........
.........
....0....
.........
.........

After first minute


.........
....1....
...101...
....1....
.........

After second minute


.........
....2....
...212...
..21012..
...212...
....2....
.........

Of course, the virus starts spreading from 10 different places on the surface, so depending on where these cells are, the time it takes to infect entire matrix can vary. Our task is to find that time given 10 initial locations.

It may be tempting to rely on a raw processing power of modern computers and concoct a solution that looks like this:

while (!matrix_is_fully_infected)
{
infect_next_set_of_cells();
}

The model above simply recreates the behavior of the virus. The obvious drawback here is the sheer inefficiency of the algorithm: we end up scanning entire matrix an unknown number of times. As matrix size increase, the inefficiency will be more evident. Still, this may be a valid approach in some cases, where there is no easy analytical solution. Fortunately, our virus has a primitive DNA and yields itself to mathematical definition.

For simplicity, let's assume that we begin with a single infected cell with coordinates (a,b). The number of minutes it takes to infect an arbitrary cell (x,y) can be expressed with this simple formula: |a-x|+|b-y|. Now let's assume we had a second infected cell at the beginning: (c,d). We could use a similar formula to find out how many minutes it will need to infect our arbitrary cell (x,y): |c-x|+|d-y|.

Depending on whether (a,b) or (c,d) is located closer to (x,y), one of the above expressions will produce a smaller number of minutes. This will be the answer to the question "how long it takes to infect a single arbitrary cell". As we go from 2 infected cells to the original 10, we can write the answer as a function of (x,y):

min(|a_i - x| + |b_i - y|), where 1 <= i <= 10

Of course, our job is not done yet - the virus doesn't stop until all cells are infected. What we need to find out is how many minutes it will take to infect the last cell. Evidently, this will be the maximum time across the matrix, so our solution will be to take the maximum of the above function:

max( min(|a_i - x| + |b_i - y|) ), where x and y vary across matrix dimensions

As you can easily see, analytical approach provided significant performance improvement - we now only need to scan the matrix once.

Monday, January 28, 2008

LINQ To SQL Performance

In the interest of full disclosure, I must say that I like LINQ. I think the declarative approach to data manipulation is just wonderful! It makes code much cleaner and code maintenance much easier. There is an overhead, of course, but with LINQ to Objects all calls are in-process, so it's still going to be fast.

When it comes to LINQ to SQL, though, the game changes somewhat: not only database access is out-of-process, it is also notoriously tricky. Poorly designed queries can take minutes instead of seconds and drain server resources, so it is very important to know what T-SQL is being generated for a given LINQ query. But that is a topic for another post...

What I really wanted to know is - all other things being equal - how much performance overhead does LINQ to SQL add on top of ADO.NET. So, I built my test harness around one simple query against Northwind database:


SELECT [t0].[OrderID], [t0].[OrderDate], [t0].[CustomerID], [t1].[CompanyName], (
 SELECT COUNT(*)
 FROM [dbo].[Order Details] AS [t2]
 WHERE [t0].[OrderID] = [t2].[OrderID]
 ) AS [ProdCount]
FROM [dbo].[Orders] AS [t0]
INNER JOIN [dbo].[Customers] AS [t1] ON [t0].[CustomerID] = [t1].[CustomerID]
WHERE [t0].[OrderID] = @OrderID

Basically, for a given order ID, we are retrieving a single row with information from order, customer, and order details tables:


OrderID     OrderDate               CustomerID CompanyName                              ProdCount
----------- ----------------------- ---------- ---------------------------------------- -----------
10248       1996-07-04 00:00:00.000 VINET      Vins et alcools Chevalier                3

(1 row(s) affected)

In order to get measurable results, I wanted to run this query for every one of 830 orders (independently, to simulate a multi-user application). For benchmarking, I used three alternative approaches: dynamic SQL, dynamic SQL with parameters, and stored procedure. Because SQL Server caches query execution plans, I restarted it before switching to a different approach.

Dynamic SQL

In the method below, I am concatenating order ID value to the end of the WHERE clause. This technique is known to have poor performance, because SQL Server doesn't realize we are using the same query and will have to compile it every time. Indeed, the initial run took approximately 4100 ms on my laptop. SQL Server caches query plans, though, so subsequent executions of the test yielded a much better result of roughly 200 ms.

private TimeSpan RunADOTest()
{
  using (SqlConnection conn = new SqlConnection(ConfigurationManager.ConnectionStrings["Northwind"].ConnectionString))
  {
     conn.Open();
     SqlCommand cmd = conn.CreateCommand();
     cmd.CommandType = CommandType.Text;
     DateTime dtStart = DateTime.Now;
     for (int orderId = 10248; orderId < 11078; orderId++)
     {
        cmd.CommandText = @"SELECT [t0].[OrderID], [t0].[OrderDate], ...
          WHERE [t0].[OrderID] =  " + orderId.ToString();
        SqlDataReader dr = cmd.ExecuteReader();
        dr.Close();
   }
   return DateTime.Now.Subtract(dtStart);
  }
}

Parameterized Dynamic SQL

The only difference from previous method was that query WHERE clause changed to "WHERE [t0].[OrderID] = @OrderID". I also added these lines before calling cmd.ExecuteReader():


   cmd.Parameters.Clear();
   cmd.Parameters.Add(new SqlParameter("@OrderID", orderId));

As expected, performance has improved. After SQL Server restart, the method completed in 460 ms, and subsequent executions were around 190 ms. It's also important to realize that SQL Server has cached only one query plan and not 830 as in the previous example.

Stored Procedure

Ever since Microsoft added query plan caching for dynamic queries in SQL Server 2000, there really is no performance difference between stored procedure and a parameterized dynamic query. Of course, there are many good reasons for writing stored procedures (greater security, better code reuse, smaller network traffic).

In my test harness, results were nearly identical to parameterized dynamic SQL: initial run (after SQL Server restart) took 453 ms, and subsequent executions took 187 ms.

LINQ To SQL

I dropped Customers, Orders, and Order Details tables to the surface of Visual Studio object relational designer to create a LINQ to SQL classes. In order to ensure that all calls are executed using a single database connection, NorthwindDataContext is initialized with an open SqlConnection object. I also wanted to make sure T-SQL generated by LINQ is the same as in previous tests, so I installed SqlServerQueryVisualizer component and ran a SQLProfiler trace. Indeed, both Parameterized DSQL and LINQ to SQL tests issued the same exact exec sp_executesql command. I had to use query variable in a foreach statement to make it run (because of deferred execution).

Test results were, frankly, disappointing. After SQL Server restarted, the method took 4250 ms. Subsequent executions yielded between 3300 and 3400 ms, or more than 10 times slower than all other tests.


private TimeSpan RunLINQTest()
{
  DateTime dtStart = DateTime.Now;
  using (SqlConnection conn = new SqlConnection(ConfigurationManager.ConnectionStrings["Northwind"].ConnectionString))
  {
     conn.Open();
     NorthwindDataContext db = new NorthwindDataContext(conn);

     for (int orderId = 10248; orderId < 11078; orderId++)
     {
        var query = from o in db.Orders
          join c in db.Customers on o.CustomerID equals c.CustomerID
          where o.OrderID == orderId
          select new
          {
            o.OrderID,
            o.OrderDate,
            o.CustomerID,
            c.CompanyName,
            ProdCount = o.Order_Details.Count
          };

        foreach (var item in query)
        {
          string s = item.CompanyName;
        }
     }

     return DateTime.Now.Subtract(dtStart);
  }
}

Test Summary

The table below summarizes test results. As you can see, after initial query plans are cached, dynamic SQL and stored procedure have essentially the same performance. Either approach is 94% faster than LINQ to SQL.

Can't say I'm happy with these results (after all, I do like LINQ), but at this point I don't see what else can be contributing to the delay. Unfortunately, I only have Professional version of Visual Studio which doesn't include a profiler, so I can't research further. However, I welcome any comments or corrections. Source code of the test harness is posted here: http://members.cox.net/rmamedov/blog/LINQ2SQLPerformanceTest.zip.


                   Initial  Subsequent

Dynamic SQL             4100         200

Parameterized
Dynamic SQL              460         190

Stored Procedure         453         187

LINQ to SQL             4250        3350