Monthly Archives: April 2013

History tracking

Chances are that if you design a system you also have the requirement of tracking data changes, also known as history tracking or audit trailing. The mechanism of capturing data changes can be achieved in a number of ways, each one of them having its own advantages and disadvantages. This article is a summary of my research on this topic.

For the purpose of this article, I will be using a simplified BlogEntry table, with the following columns:

h1

Only tracking last changes

Sometimes it’s enough to just track the creation and modification information of data, instead of the full historical information. In that case, you can simply add columns LastModifiedBy, LastModificationDate, CreatedBy and CreationDate to all of your your tables and make sure they are filled in correctly when a new row is added or modified. For deletions, in most cases you don’t want to actually delete the row in your database, but change its state to inactive (soft delete).

Such table could look like:

h2

Full tracking using history tables

Of course, if you want to track the actual data changes and keep a full history of revisions that happened, we need to do more. In this case we want to track:

  • The information that was changed.
  • The date on which the information was changed.
  • The user that changed the information.

We have two approaches here:

  • Shadow history approach: the history tracking information is stored in a referenced table, which can be:
    • A separate history table for each table being audited.
    • A consolidated history table for all tables being audited.
  • Live history approach: the history information is stored in the same data table as the business data.

Shadow history

A separate history table for each table being audited

A history table can be a clone of the original table with additional columns to track what user did the change and when he did it:

h3

With every change that happens, a new entry that describes the change (insert, update, delete) is added to the history table; and the actual row in the original table is modified. There are no constraints in the history table.

Advantages:

  • Good performance: you can query fast live and historical data.
  • Cleaning up historical data is easy – just delete rows older than a specific date.

Disadvantages:

  • You end up with a lot if duplicate data without registering what actually did change; so you would have to compare data to know what fields exactly changed.
  • It potentially could double the amount of tables (bad for maintenance).
  • The structure of this table is tied to the original table, so every change in the original table structure results in a change in the history table structure.
  • Any change in a table results in a change in its history table.
  • Entities and their historical counter parts are different, so different logic has to be implemented.

This would be a good solution in case you want to be able to quickly access the real data, but not if working with historical data is an important part of your application.

A consolidated history table for all tables being audited

You can add one generic history table using the Amorphous Table technique as follows:

h4

With every change that happens, a new entry that describes the change is added to the generic history table; and the actual row in the original table is modified. There are no constraints in the history table.

Advantages:

  • You only have one extra table that holds history information of all of your tables, so less work.
  • Also, it separates concerns better: it captures what changed, nothing more and nothing less.
  • It takes less space due to the fact that it stored partial history instead of the entire structure.
  • Cleaning up historical data is easy – just delete rows older than a specific date.

Disadvantages:

  • It’s harder to reconstitute a record to a given date (although once you have it, the same logic can be used for all tables).
  • The values stored must be of a generic type, so there is casting involved.
  • Entities and their historical counter parts are different, so different logic should be implemented.

This would be a good solution in case you want to be able to access the real data quickly and easy, but not if working with historical data is an important part of your application.

Live history

In this approach the history information is stored in the same data table as the data itself. If data is modified or deleted, the original record is not updated; instead a new record is created that holds the latest information. The new (most recent) record becomes the live record (active), while all previous records form the history.

h5

Advantages:

  • No new database tables have to be created.
  • Entities and their historical counter parts are the same, so the same logic can be used.

Disadvantages:

  • You mix historical data with live data, which can make queries more complex.
  • Tables may become very big, so without the right optimization strategy (indexes) performance can be poor.
  • Cleanup of historical data might become complex (especially if they have references to live data).

This would be a good candidate if you are dealing with low data volumes, historical data has a big meaning in the application and is heavily used.

Some principles

Where to put tracking logic

The places we could implement the tracking logic are:

  • In the database itself, using triggers.
  • In the application logic, using code.

Doing this in application logic has the advantage that you have the full business context and you can also audit non-database operations, which may be important as not all data might be database centric. But you have to be sure that your database is not altered directly, you cannot bypass the application layer otherwise there will be no auditing!

Doing this as part of a trigger requires you to also put needed business context in the database, but the advantage is that auditing will always work, even if data is changed directly in the database without using the application layer, like from a different application, from a SQL command line, some script or directly from within SQL Server Management Studio.

So what to choose? As always, it depends… in my opinion, if you are concerned with people bypassing the audit trail by updating the data in the database directly, you may have other (operational) problems that have to be solved first. Auditing is a business requirement, so it belongs to your business part of the application, and therefore not in the database. If you have a business layer, it should not be bypassed, otherwise it makes no sense to have this layer after all.

Rolling back transactions

What about rolling back transactions? In the case of auditing in application logic, if the transaction is rolled back then every historical change is rolled back too. In the case of triggers, every trigger that got executed as part of the transaction is also rolled back. So both mechanisms work correctly when it comes to rolling back transactions.

Rules

Some rules that you may want to apply:

  • Tracking information cannot be modified. From the moment historical data is in the database, it cannot be altered by any component.
  • Aim for maximum performance: use indexes to improve the speed of data retrieval operations.
  • Only use historical data on tables that actually require tracking. Doing this on all tables may be overkill.
  • By default, prefer soft deletes, otherwise you loose historical data.

How to choose?

Before making a decision on what strategy to use in your case, you must analyze your requirements and constraints:

  • Will you be using historical data on a regular basis or will it be used rarely?
  • Is historical data used for reporting purposes, or will it be used to revert data to a previous state?
  • How much data is involved?
  • Should historical data be cleaned regularly?

It’s an exercise that should be done accurately, as it is a fundamental design decision that cannot be changed easily later on.

Advertisements