Deferred Execution with Entity Framework and LINQ
Dave Toland
25/04/2020
What is Entity Framework?
Entity Framework is an Object Relational Mapper, or ORM for short. It's been around for quite some time now, and with the introduction of .Net Core came Entity Framework Core. Whilst I'll continue to refer to it in this article as Entity Framework, or EF as it's commonly known, I'm actually talking about EF Core. Conceptually, it's the same thing, but under the covers it works in a different way, and it's much easier to use now.
In very simple terms, it is a layer that sits in between your project and the database, and handles the translation between code and SQL. Of course, it does much more than that, and between Microsoft and the rest of the internet, its many facets and functions are very well documented. Entity Framework is a vast, complex beast, but essentially it creates an abstraction that helps you interact with your database using code that you're familiar with. I'm predominantly a C# developer and so in this article I'll be focusing on that language, but there are various others that can achieve the same result.
Deferred Execution
Deferred Execution is a very powerful mechanism, in simple terms it lets you build queries without executing them just yet. This is a subject that's been written about a lot and is often misunderstood and can be confusing when you first come across it. There's a lot of instances where you might want to use deferred execution,and there's a lot where you wouldn't, so it's important to understand exactly what it is. To clarify, deferred execution is not specific to Entity Framework, it's actually part of the LINQ (Language Integrated Query) library. The area of LINQ which is pertinent to database interaction used to be known as LINQ-to-SQL (I think they've dropped that term now, though the concept of using LINQ to generate SQL is still very much current), and arguably this is where the benefit of deferred execution is most effective.
So why would you build a query and not execute it? This is possibly the most asked question on the topic and one which I'll try to answer over the following sections, but in a nutshell it's so that we can build queries that cover a lot of bases and then filter and refine those queries based on the particular use case we have at the time. To properly understand why this is important, you need to understand that executing a query on a database is one of the most expensive operations in the lifecycle of an application. Large and complex database queries can have a massive impact on the performance of your application, they can be very slow in extreme cases and querying for large amounts of data which involve joining multiple tables together, only to select a few items from the result set and discard the rest are one of the most wasteful and unnecessary ways to slow down your system (not to mention the cost in actual cash if you use a cloud hosted database).
Deferred execution is one way to slim down your query to only the columns and rows you actually require without the overhead of having to write a ton of finely grained complex query logic tailored to each individual use case you have.
IQueryable vs IEnumerable
By definition these are both collections of results, and to look at their outputs in a debugger you'd be forgiven for thinking they are essentially the same thing. They differ in a big way through - an IQueryable
has not been executed, whereas an IEnumerable
has; often through a call to ToList()
but there are other methods which have the same executing effect like AsEnumerable()
, and to an extent ToDictionary()
(which produces an IDictionary
rather than an IEnumerable
, but you get the idea).
Whilst IQueryable vs IEnumerable is a catchy heading, we're actually talking about queries vs results here, and that includes individual results as well as collections. We can use methods like Single()
, First()
and Last()
(including their nullable variations SingleOrDefault()
, FirstOrDefault()
and LastOrDefault()
to execute a query that follows the same pattern - essentially we're still talking about executing a LINQ (SQL) query on the database.
So what's the point? You're going to execute it at some point anyway, so why the delay? ...consider methods like Skip()
and Take()
, these allow you to return a subset of the rows in your results. Skip()
, as the name suggests, lets you skip over a number of results. Say you're paging results, a common use case, particularly in API or MVC setups. You already have the first 10 results and you want results 10-20, Skip let's you ignore results 0-9 and Take()
lets you return exactly 10 following items. Used in combination, that's your paging mechanism. It's probably easier to go into an example here:
First we build our base query, we'll use an imaginary Vehicles table and we're showing vehicles that have the type Car, so we'd do something like:
Vehicles.Where(x => x.VehicleType == 'Car');
Now let's assume there are hundreds of vehicles that are cars, and we're creating a paging system where the consumer can choose to only show 10 at a time. They've already had the first 10 results and are now asking for results 10-20. We'd apply Skip()
and Take()
to produce this as follows:
Vehicles.Where(x => x.VehicleType == 'Car').Skip(10).Take(10);
The difference between IQueryable
and IEnumerable
here is this - assuming that Vehicles is a Table in the database, if we called ToList()
on our initial Where
query, we'd execute a SQL command that returned every row in the Vehicles table where the VehicleType is a car, that could be hundreds, thousands, millions... or more. We then call Skip()
and Take()
to reduce that down to just 10 items, but by that point we're just doing this in memory. We've returned a potentially enormous amount of data from the database, which puts load not only on the application, but the database itself and the network(s) in between... only to drop nearly all of that data once we've retrieved it. Hopefully you can see why this is a hugely inefficient operation. For simplicity sake let's remove Skip()
and just concentrate on Take()
:
IEnumerable<Vehicle> vehicles = Vehicles.Where(x => x.VehicleType == 'Car').ToList();
return vehicles.Take(10);
You'll often see these calls happening in two different places, which makes it more subtle to look at. Perhaps there's a VehicleRepository which does the query and returns an IEnumerable<Vehicle>
and then a controller which calls that and then performs the Take()
. I see this a lot, and at first when there's not much data, or in a Unit Test scenario, the performance hit isn't realised. When you go to production though, and the database has lots of data, it's a different story. If you don't know what you're looking at, it can be difficult to diagnose the slow performance that you're not seeing in isolated tests, and it gets gradually worse over time as the system scales up and more data is added. I've seen teams go into great lengths with load testing to try and identify the problem, fail, and then side step the real issue by throwing costly extra CPU and hardware resources at it.
Effectively here you're executing a SELECT * FROM Vehicles WHERE VehicleType = 'Car
query on your database, and returning every row of data you have, and if you've got a lot of columns in your Vehicles table this is only going to be worse.
However, by using deferred execution, and working with an IQuerable
instead of an IEnumerable
, we can tune the SQL statement that's built and make it far more efficient:
IQueryable<Vehicle> vehicles = Vehicles.Where(x => x.VehicleType == 'Car');
return vehicles.Take(10).ToList();
It's a subtle change, to the eye it looks almost identical, but the effect is that we're now executing a SELECT TOP 10 * FROM Vehicles WHERE VehicleType = 'Car'
query and returning just the 10 rows we care about, and you don't need to be a SQL expert to understand the difference that will make. The point is, using an IQueryable
to filter the data before the SQL is executed is always going to be more efficient, because you're making the database do the work and reducing the result set it has to produce. This is just one example, and the same theory could be applied to things like Distinct()
, Intersect()
, Union()
and even sorting with OrderBy()
which are all better off being done in the database, rather than in memory with code.
Select and Projections
So far we've covered some ways to reduce the amount of rows returned, and this alone can make a big difference, but what about reducing the columns? What we've seen above still executes a SELECT *
(even if that's a reduced to something like SELECT TOP N *
), but if you've spent any time writing SQL you'll know that you generally don't want to do that. More often than not database tables contain a whole load of columns, metadata, last updated dates, etc, etc that we don't need to return for many of our use cases. However, we also don't want to be writing a ton of different query methods with only microscopic differences between them to cover all the permutations we need.
This is where the almighty Select()
method comes in. It covers a far wider topic area than just Entity Framework so if you don't know what it is and does, go and learn it. This is a good place to start. Anyway, essentially, by using Select()
we can apply column filtering to our query. Let's jump into another example..
Say we have a grid view in our UI that shows our paged Vehicle results. We've covered how to show 10 at a time, but let's imagine our Vehicles table has 20 columns, or 100, or more. In our grid view, we're only displaying a few bits of data, for argument's sake these are: Manufacturer, Model, Engine Size, Year, and Price. If we were writing the raw SQL, we'd go with something like SELECT Manufacturer, Model, EngineSize, Year, Price FROM Vehicles WHERE VehicleType = 'Car'
.
How can we do that in Code? We use a projection... What's a projection? It is basically the act of turning the data from one shape to another, using a call to Select()
. It's actually a really powerful call that can be used for a number of different tasks, but for this example we're going to use it to project our big data set into a smaller one:
IQueryable<Vehicle> vehicles = Vehicles.Where(x => x.VehicleType == 'Car');
CarDTO dto = vehicles.Select(x =>
new CarDTO
{
Manufacturer = x.Manufacturer,
Model = x.Model,
EngineSize = x.EngineSize,
Year = x.Year,
Price = x.Price
})
.ToList();
There's much more you can do here, like aggregating values (i.e. ManufacturerModel = $"{x.Manufacurer}/{x.Model}"
), totalling, counting, or even doing subqueries but we won't over-complicate the example with those. Just know that Select()
can be used to reduce your query to a very fine grained form that does what we've been trying to do all along - reduce the result set to just the data we need for this particular use case. It directly translates into a smaller SQL query, through the power of LINQ and deferred execution. What you're actually doing here under the covers is to turn your SELECT *
into SELECT Manufacturer, Model, EngineSize, Year, Price
, hugely powerful.
The added beauty about this is that under pinning this is our base query that returns the IQueryable. That hasn't changed, and we can use that for both the examples above, and many more. Generally, you'd put that into some sort of data layer, and as you'll probably have all the other Entity Framework related bits like your Models, your DbContext and any logic around model building, perhaps some partial classes to tag on properties etc in there as well, it's nice to be able to keep the query logic as thin and generic (and readable) as possible.
By using projections you're offloading that extra query building logic onto the framework itself, it all happens by convention and means that you don't need to go to the extra lengths of creating a ton of stored procedures etc. It's neat, it's very readable, and you really don't need to know much SQL to make it work effectively.
The Repository Pattern
I'm not going to go into great lengths on what this pattern is, there are lots of very good articles on the internet about what it is, how to put it together and build projects around it. Very simply put, it's a way of separating out all of this logic into layers and it does a very good job of enforcing several well documented design principles including SOLID and SoC - Separation of Concerns. If you don't know about these, you should; I'm not into lecturing people on design theory so that's all I'll say on it, except that they'll make both your life, and those that have to read and work with your code a lot easier in the long run.
The way I like to think about things in this regard is that a repository is the thing that deals with the data, it performs queries against the data source and it uses objects to represent the results of those. Theoretically speaking, the Repository is the data source, but when we're working with a database then as far as I'm concerned that is the data source and the repository represents it. Anyway, semantics... Filtering, ordering and distinction of data, joining related tables or entities, etc are all jobs for the repository. Mostly, I'd say that if you're writing complex (or even not so complex) Where
clauses, then they probably belong in a repository. You want to be able to call a repository with methods like GetAllCars()
, GetCarsByManufacturer()
, GetCarsUnder1000()
and so on. Offload all that type of stuff to the repository so that you can make simple one-line calls to them. Most importantly, and this for some can be a point of contention, but in my book, I'd say these methods should nearly always return IQueryable<T>
. I'll go into the theory of why some schools of thought say to avoid this a bit further down, but the benefits of doing it this way are that you can then leverage the power of deferred execution in your business logic layer. As long as you understand what you're doing, and you don't expose them to the consumer layers, there won't be any problems.
As I've just mentioned it, let's move up through the stack to the business logic layer. This is where we'll put logic that refines our data sets down to suit particular use cases. It's not a hard and fast rule, but this is how I like to do it. I don't like to do projections in the repositories, to me that's not a data task. Domain objects are not part of the data layer, they're part of the domain, and therefore should be created transformed and processed with the business logic. Use cases are your business requirements, they're not things that affect your data pe say. In our above example, we ultimately have a grid of car data that's presumably being displayed to the user. Don't get hung up on presentation at the moment, whether it's being displayed on a form, a web page, a mobile device or something else is irrelevant at this point. We just know that we need to refine our vehicle data down into a subset of fields (manufacturer, model, engine size, etc) for the consumer. This is a use case for a particular business requirement, and therefore it goes in the business logic layer.
The CarDTO (call it what you like) is part of the domain, and the Vehicle entity is part of the data repository. One of the jobs of our business logic layer is to translate our data into a shape that we'll pass up to the next layer - be that an API to be converted into JSON, or an MVC View to be formatted for presentation, whatever, that's for the layer above to decide and that's a separate concern. By translating the entity into a DTO, or domain object, we're not only refining the underlying query to make it as efficient as possible, but we're also ensuring that no private data is exposed to layers that shouldn't have access to it. In an extreme example, if you were building a public User
result here (perhaps for a Who's Online webpage) you wouldn't want to include sensitive information about addresses or worse still, expose any password hashes for example. So as a combination of using the business layer to project into domain objects which don't contain sensitive data at all, and encapsulating the entity data and logic into the repository layer, we're ensuring that data is never exposed to the upper layers where consumers have access to, whilst at the same time keeping the system lean and clean.
The way I like to put these layers together is as follows:
Vehicle Controller | ⇄ | Vehicle Service | ⇄ | Vehicle Repository | ⇄ | Vehicles Table |
---|---|---|---|---|---|---|
(API Layer) | [DTO] | (Business Logic ) | [IQueryable] | (Repository Layer) | [EF] | (Database) |
Think of it as a pipeline, from left to right, and back again. The controller handles the request and deals with things like HTTP and Validation, the service firstly calls the repository which in turn generates SQL via Entity Framework and calls the database, returning the results as an IQueryable<T>
, the service then refines this through projection into a domain object and passes it back up for delivery. That's a very basic and high-level view of things, but hopefully you get the general approach, and that this gives you some insight into how the Repository Pattern can help to keep everything separated to avoid accidentally exposing sensitive information (amongst many other benefits). It certainly comes into its own when you're dealing with a big complex solution.
AutoMapper and Entity Framework
AutoMapper is a fantastic toolset written by Jimmy Bogard and others. It does many things, but centric to it all is the concept of enabling us to convert one thing to another with as little boilerplate code as possible. If you don't know about AutoMapper then check it out, it's extremely well documented and Jimmy et al are active on Stack Overflow and other channels to help if need be; I use it in almost every project I create, it's well worth investigating.
One of the most powerful sections of the library is Queryable Extensions which gives us the ProjectTo<T>
extension method. This can be applied to any IQueryable
and effectively removes the need to explicitly write out the Select call. Using our example above:
CarDTO dto = vehicles.Select(x =>
new CarDTO
{
Manufacturer = x.Manufacturer,
Model = x.Model,
EngineSize = x.EngineSize,
Year = x.Year,
Price = x.Price
})
.ToList();
...becomes:
CarDTO dto = vehicles.ProjectTo<CarDTO>().ToList();
Be careful of exposing IQueryable
I touched on this point earlier, and it definitely deserves thought. It is this reason that some people don't like to expose IQueryable
results from their repository classes and I'll explain why. One of the things you can do with an IQueryable
is call the Includes()
method, this essentially tells your database to join other tables to your query. I'm sure you can understand why that could be a very bad thing, you return an IQueryable<Customer>
and you could be inadvertently giving your consumers the green light to include the User table, maybe the Passwords table, in fact pretty much any and all tables. Not only can this result in some disastrously bad SQL queries being executed on your database, you can expose highly sensitive information and even hand over the ability to delete everything. Be very careful about how and what you give access to the IQueryable
instances. They should only ever be dealt with on internal layers, never pass them back to consumer access layers, and also ensure that anyone with access to the codebase understands what they're doing when they're writing code against them. You are essentially handing over deep level database control to whoever has access to them. As a general rule, they should not pass the boundary of the business logic layer, always project them into domain objects / DTOs and remember to have the query executed using something like ToList()
or Single()
, First()
or Last()
before they're returned.
Different execution = (potentially) Different results
Last but not least, you need to be aware that executing your IQueryable via deferred execution can mean you get different results. If the data in your database changes between executions, or you're using a dynamic date parameter, or some other moving target, you may get a different result set. In Rider, and Visual Studio with ReSharper installed, it will warn you about possible multiple enumerations, and this is a potential side effect of that.
Now you may want that, if you want to refresh your data for example, in many cases you won't. There's absolutely no rule here, other than to be aware of the consequences. At the very least, you need to be aware of making multiple calls to the database when you don't need or intend to; remember, database calls are one the of the most expensive operations in the application lifecycle, and you're defeating the object of making that process as lean as possible only to start doubling up on it.