Pure SQL
[Printable PDF, 5 pages, 220 kb]
Three steps for better performing and more scalable database access
The late 1990s and the beginning of the 21st century is
the era of in-memory representation of database information. Dozens of toolkits
have been made available to provide code-centric abstractions for your
relational data. The object-oriented view of data was the highest goal.
Databases were nothing more than a cumbersome means for persistence - just dumb
data stores. But are these tools - and the associated thought patterns
- the
best choice for highly scalable and high-performing applications?
Essentially all databases in wide use today have been optimized to work with
tabular data and its relations. SQL itself is designed around set-based
operations and not for the manipulation of single records. This view is in
direct opposition to the object-oriented world, which is based on working with
distinct objects or single entities. Even though you can use a diverse set of
collections to represent relationships and sets in your object-oriented
applications, one question remains: Does it really make sense to keep and modify
in-memory copies of your data in high-transaction environments?
Highly Scalable
UPDATEs and the End of Automated Tools
In fact, it is quite likely that you will
reach some limits of any automated tool - DataSets, DataAdapters or custom
object/relational mapping code - as soon as your application requires highly
scalable transactional database access. But dont take my word for it - let me
instead work with you on a sample application in a number of iterations.
As the
basis for the suggestions in this article, Ill look at a small fragment of an
application for maintaining article information, inventory levels and orders.
This application has to track an article ID, its name and the available
inventory for each article. When an order is placed, your program has to verify
the articles inventory level, because your business analysts identified a
business rule that states that orders must only be accepted for articles which
are in stock. In addition, this available amount has to be decremented if the
order is accepted.
Step One: Separate Your Data
In the first iteration, some
developers may consider creating a table named Article with fields ArticleID,
Name and UnitsInStock. In fact, even sample databases like Northwind - which
is shipped with every instance of SQL Server - are designed in the same way.
I believe that this is wrong. To create more scalable systems, you should design
your database with more than just strict normalization in mind. You should also
identify the transaction volume for each piece of data and group them
accordingly. In the application above, this means that you should actually
create two tables: Article, with fields ArticleID, Name, and other static
information. And a second table Inventory, with fields ArticleID and UnitsInStock.
This allows your application always to work with article information without
being affected by transaction locks on the Inventory table.
After identifying
the best table structure, you now have to choose the right database-access
strategy for your code. Should you map your data to .NET objects? DataSets? Or
just use DataCommands and their Execute* methods? I tend to recommend a very
pragmatic solution for most applications: For all pieces of data that are mostly
read - and that seldom participate in high-transactional UPDATEs
- simply choose
the most comfortable approach. According to your personal preferences and
project focus, this could include the use of an object/relational mapper or
DataSet.
If you decide to use an object/relational mapper, Id recommend that
you do not implement this functionality on your own. There are a number of
excellent products available, with costs ranging from zero to a few hundred USD
or EUR per developer. I know that the temptation is high to build your own O/R
mapper, but this is a complex task which does not usually bring you closer to
solving the business requirements for your application.
For all kinds of data
that are changed with high transaction counts, however, you should consider
either manually creating all UPDATE and DELETE statements or using appropriately
optimized stored procedures. Please bear in mind that automatically generated
SQLs or stored procedures usually will not allow you to achieve the highest
possible levels of scalability and performance. The reason for this is that the
underlying tools have to generate very generic SQL, making application-specific
optimizations impossible. After all, only the developer of the application - you
- knows
about its detailed database access requirements.
A Short Trip to Optimistic
Concurrency
One of the core problems of automatically generated database-access
methods is that they usually offer a generic, optimistic-concurrency model. The
main assumption behind these optimistic models is that most of the time only one
user will change a given piece of data. Corrective action will only be taken in
the negative case - if a previous change by another user is detected during the
update.
This assumption is perfectly valid for most near-static data, such as
articles and customers. This optimistic approach, however, will have very
negative effects if you are working with data that are changed in
high-transaction environments. It can be the cause of excessively long database
locks, or even deadlocks.
But what exactly is an UPDATE for optimistic
concurrency, and why is it bad for transactions?
When changing data in this way,
the first step is to load the current value of a record into some in-memory
representation, such as DataSet. After this, your application can modify the
data to represent the desired final state that should be stored in the
database after your transaction. As soon as your application decides to keep the
change, it generates an UPDATE statement, in which the WHERE clause contains not
only the records primary key but also a timestamp, or the previous values of
the changed fields.
Before showing you some SQL examples, let me say one more
thing: I am going to show you literal SQL without the recommended definition of
query parameters. I do this solely for illustrative purposes, as doing so makes
it easier to talk about these statements. In your actual application, you should
always use query parameters, stored procedures, or at least check all your input
data before passing it to your database server. You need to do this at least to
protect yourself against SQL-injection security attacks. The principles I
discuss below apply equally to the creation of these prepared queries or stored
procedures.
If your Products table for article 42 shows a current inventory
level of 15 and you would like to store an order for five pieces, SQL similar to
the following would be generated: UPDATE Inventory SET UnitsInStock = 10 WHERE
ArticleID = 42 and UnitsInStock = 15. The second part of the WHERE clause (UnitsInStock
= 15) ensures that the UPDATE fails - or returns zero hits - if some other user
modifies the amount in the meantime. In this case, your application receives an
Exception and has to reload the record, re-apply the changes and try again with
a new UPDATE statement.
This is the first major disadvantage of using
automatically generated SQL with data modified in high transaction counts: If a
large number of users are running the same application - modifying the same data
- these additional re-SELECT/re-UPDATE cycles will happen quite often.
Step Two:
Reduce Deadlocks by Defining a Specific UPDATE Strategy!
Lets further suppose
that two purchase orders are entered at the same time. The first PO is for two
pieces each of articles 42, 43 and 44. The second PO is also for two pieces of
each article, but in reverse order: 44, 43, and 42. Both purchase orders are
stored at the same time in two transactions.
Processing starts with the first
order and the UPDATE of article 42, which is successful. After this - depending
on the time delay - the second order will change the inventory level of article
44. Subsequently, article 43 will be changed according to the first PO. As a
next step, your application will try to change article 43 again, this time to
reflect the changes of the second PO. This UPDATE will not succeed, because the
corresponding record is currently locked by the first transaction. Consequently,
the processing of the first order will continue with an attempt made to update
its last article. This change, however, will also fail, as article 44 is still
locked by the second transaction. This is a classic deadlock scenario: The first
transaction waits for article 44, the second for article 43. Each record is
locked by the other transaction. In most modern databases, this situation is
automatically detected; one transaction is designated the deadlock victim and
aborted. Your application receives a corresponding Exception and has to restart
the transaction from the beginning.
The interesting question is whether these
deadlocks can be avoided, and if not, whether the likelihood of their occurring
can be reduced.
A general way to avoid similar situations is to define and
follow an update strategy. In the application above, it would be enough simply
to execute the UPDATEs sorted by the affected articles IDs. In that case, you
would always UPDATE the articles in the sequence 42, 43, 44 - no matter what
sequence users followed when entering the elements of the two purchase orders.
It therefore does not cause a deadlock if the processing of the second PO hits
the lock of article 42. The first PO will be completed successfully, because
neither article 43 nor article 44 is locked. The lock on article 42 will be
removed after completion of the first transaction, allowing the second PO also
to be stored successfully without any need for a retry.
This guideline can be
slightly generalized to say that resources should always be accessed in the same
sequence. This includes not just single records, but also access to database
tables in general.
In most cases, you can achieve this kind of UPDATE strategy
only with an explicit, application-specific database-access code. Automated
tools generally will process all rows in the exact sequence in which they have
been entered, leading inevitably to a database code thats prone to deadlock.
But still: If you are following an in-memory, optimistic concurrency approach as
described above, there is yet another drawback. When the second order is
processed, your application has to send additional SELECTs and UPDATEs to
compensate for the changed inventory level. The total database interaction for
these two purchase orders consists, therefore, of nine SELECTs and nine UPDATEs.
Step Three: Reduce the Number of Database Interactions
The original reason for
the first SELECT was the need to read the current inventory level, to verify the
stock level (according to the business rule), and to decrement the in-memory
representation reflecting the new stock level after the transaction.
Interestingly enough, this entire operation can be handed over to the database.
If, for example, you had 15 pieces of article 42 in stock and received an order
for four pieces, you could (in the worst case) work as follows: You send a
SELECT to the database, transform the result into a DataSet using a DataAdapter,
change the inventory level in the DataSet to the new value of 11, and finally
use a DataAdapter to send a corresponding UPDATE statement to the database. This
requires two database round-trips (in the positive case, two more if someone has
changed the value in the meantime), and two (or again, four) transformations to/from
in-memory XML in the DataSet.
Lets think the unthinkable: How about not using
any in-memory representation of the inventory level? Instead of using a DataSet,
you would hypothetically only call the necessary SQL statement or stored
procedure directly via an IDbCommand object. Instead of pushing your data
through SELECT->DataAdapter->DataSet->DataAdapter->UPDATE, you would send one
simple SQL statement to your database.
Sound too good to be true?
Actually I
would suspect that this is the way most people designed their software before
Resultsets, Recordsets, DataSets, and similar constructs became available. In
fact, you can bring about the same behavior simply by sending this SQL statement
to your database: UPDATE Inventory SET UnitsInStock = UnitsInStock - 4 WHERE
ArticleID = 42 AND UnitsInStock >= 4. This statement decrements the inventory
level (UnitsInStock = UnitsInStock - 4) after verifying the current stock level
(UnitsInStock >= 4). It is also highly scalable, as it does not introduce
further SELECT/UPDATE cycles, which would prolong the duration of the
transaction.
After executing this statement, its return value will contain the
number of affected rows. If this number equals 1, you immediately know that the
business rule has been satisfied, and that the inventory level has been
successfully decremented. You only have to send an additional SELECT statement
if you received 0 as the return value, which means that there were not enough
units in stock when the original statement was executed. In fact, if you use SQL
Server, you could even send a batched statement consisting of the UPDATE and the
SELECT to acquire the necessary data in just one round trip. It is important to
note that a change in the inventory level initiated by a different user in the
meantime will not require additional SELECT/UPDATE statements. Different
processing is necessary only if the business rule has not been satisfied.
When
processing two orders at the same time - as in the example above
- this allows
you to reduce the number of database interactions from 18 SQL statements (nine
SELECTs plus nine UPDATEs) to just six UPDATEs.
This leads to shorter
transaction times, resulting in the quicker release of database locks. As a
consequence, your application will exhibit better performance and scalability.
And all this just by performing three easy steps: separating your data according
to transaction volume, choosing an UPDATE strategy to eliminate deadlocks, and
reducing the number of database round trips with application-specific, optimized
SQL.
About Abstractions
When creating software today, we have the advantage of
being able to rely on a number of abstractions. All these abstractions hide
complexity; without them, software would not be as easily created as it is
today. It would simply not be possible or feasible to implement business
software based on x86 assembly language.
Every abstraction shields us from the
complexity of the underlying environment. What we tend to forget from time to
time, however, is that there is usually a number of reasons for the underlying
complexity. If you look at databases in particular: as soon as you demand
performance and high scalability, you will need to know about SQL optimizations,
and indexing and locking strategies. Most of these things are hidden when you
use certain abstractions. If you - as an architect - choose the wrong abstraction, a
developer might not be able to use these optimizations.
But if this is true,
theres one thing which really worries me: If an architect has to select a
matching level of abstraction, doesnt this also mean that s/he also has to know
about all the arcane details of SQL Server locking? Or, as a client of mine put
it after a long discussion of deadlocking and blocking differences for Oracle
and SQL Server: Do we really have to be database experts?
How do you make
these decisions in your projects? Who determines which level of abstraction to
use? Who has to know about SQL Server details: the architect or the developer?
Im looking forward to your ideas and opinions at
ingo.rammer@thinktecture.com.
My company, thinktecture, helps software architects and developers who need
to implement .NET and Web services solutions. We provide project-specific
training, consulting, coaching, and emergency troubleshooting services to help
you create architectures and applications which live up to the demands. If
matters are critical, one of us can usually be on site within 48 hours
throughout Europe. Drop me an email at
ingo.rammer@thinktecture.com
to talk about the details!
|