Wednesday, May 21, 2014

The object-relational impedance mismatch

In order to describe the object-relational impedance mismatch we need to define the architectural assumptions that underlie the problem:
  • the application interacts with data managed by a relational database;
  • the application is written in an object oriented language (Java, C#, ruby, python et cetera);
  • the application data model is implemented as a set of classes written in the programming language of choice.
This architecture has a big appeal in theory. Your desire is to be able to interact with data using your preferred programming language instead of being forced to use two different languages: SQL to talk to the database and your OO language to do anything else.
Since relational databases only speak SQL, you are forced to create  a layer of indirection between your class based model and the corresponding relational design managed by the RDBMS. Normally this is achieved through some tool called ORM (Object Relational Mapper). There are many ORMs on the market characterized by various degrees of sophistication (e.g. Hibernate, Microsoft Entity Framework, Ruby based ActiveRecord, ...). These tools mainly automate the repetitive task of implementing these functions:
  • create a class for each table in the relational schema, or create a table for each class on the OO model (I'm simplifying here, the mapping may not be 1 to 1);
  • supply you with suitable factory methods to create instances;
  • supply getter methods to access data coming from the relational table columns;
  • supply methods to perform queries that return single instances of classes materialized from data in the database, or collections of them;
  • create methods that allow you to navigate the data model following the inter-table foreign keys, retrieving instances or collections of related objects;
  • create setter methods to update the in-memory representation of a model object;
  • manage sessions (isolation of multiple users  making concurrent changes);
  • update the database from in-memory data and manage transactions.
Armed with this functionality you can write any business logic you can imagine just using your OO programming language and being able to ignore the RDBMS. Here I'm not considering fundamental aspects like data model evolution which significantly complicates the matter.

What's wrong with this approach?
Let's first explore a symptom of the problem. 
ORMs are like drugs. They start by charming you with their ease of use. As confidence grows on you, you keep writing more code. Then, when you are addicted to them, they betray you. Everything starts from you writing the basic CRUD functionality of your application. For simple CRUD transactions running on a developer workstation, response time is in the order of the milliseconds. The developer normally does not perceive the latent danger he is experiencing since a few milliseconds seem a very fast and reasonable response time, and other application components are normally the real response time bottleneck at this scale. But when you start implementing more complex transactions that read and modify a significant amount of data, all of a sudden you experience seriously degraded response times. And when your application goes production, things get nastier. Latencies increase due to the fact that the database is on another machine, probably with a firewall in between. This exacerbates the problem causing additional delays and unsatisfied users.

The fundamental reason why this architecture is weak is that it forces the programmer to re-implement database processing tasks in the application server tier. Since classes are mapped to the normalized relational model, the application, in order to reconstruct the user view of a document, is forced to act as a query processor. Classes are rigid containers. Reading an object means reading all of the table columns for a single row (I know some ORMs allow some optimizations but those are weak solutions) . You may only need to fetch one column value from the object, but you need to materialize all of it, since your unit of retrieval is class based. Normally a well designed relational model spreads information on many tables in order to correctly implement the application business rules (which generally also means minimizing the cost of updates). So in practice this means your application, while may be easy to write with a handy navigational API, will be fetching lots of rows by primary key over the network. With this approach the database optimizer is out of service, since it only sees a very small picture of what you are trying to accomplish in your transaction as a whole. 
You are reinventing the database in your application program:
  • every time you sort data you fetched directly or indirectly from the RDBMS;
  • every time you use a Map to aggregate detail information;
  • every time you use a Set to eliminate duplicates;
  • every time you fetch related data using accessors (application level join);
  • every time you create a new collection by filtering an existing collection;
  • every time you construct a non-persistent object that is the projection of data coming from one or more persistent classes;
  • every time you create a new collection by concatenating multiple collections.
Every time you do these things, and a few others, in the context of data fetched from the RDBMS, you are acting as the human optimizer, writing data processing your own way. The problem is your code does not have access to the data at the same speed as the RDBMS. Accessing one row from the application server is at least two orders of magnitude slower than doing it from inside the RDBMS kernel.
It is like having a shopping list and going to the supermarket and back home for each item in it instead of picking the whole list in a single round trip (this analogy is not mine).

Also, RDBMS data processing algorithms have been fine tuned in years and years of development. For example, sorting data in the application program is not even in the same league as sorting the same data in the RDBMS kernel, at least in my experience.
The RDBMS also knows about your data since it collects lots of statistics about your tables and columns. It can perform thousands of evaluations of different approaches at answering your query or performing your update and choose the one that results to be the cheapest. Chances are on average it will outperform you as a programmer. And optimizers improve at each new RDBMS release.
If you have a multi-tenant production system, chances are your data distribution will differ significantly from one tenant to another. The RDBMS optimizer will be able to change the algorithm used to process a task from tenant to tenant, given the same request you send to the RDBMS. Your hard coded program instead won't be able to adapt.
Another big problem coming from this architecture is that on the application server you normally have limited resources at your disposal, in particular when running inside a virtual machine like a JVM. Processing large amounts of data inside a JVM poses garbage collection problems that further impact end user perceived response time.

The impedance mismatch is not only a performance problem. It is clearly also a software maintenance problem. You write in an imperative language what you could instead declare in SQL. SQL is language that needs practice and has many flaws but has also many good qualities. I will talk about this in detail in a future post but it is clear that if you can avoid coding imperative loops, if-then-else, handle mutable variables and object state and NullPointerExceptions, this is a big win for software quality.

A dual data model representation is difficult to manage. If your data management code is embedded in the application, you can't take advantage of it when you have to interact directly with your data in the RDBMS. It is also more difficult to expose data to other applications that need to interact with the database, for example for reporting and analysis purposes, because derived data needs to be stored, or you won't have the derivation code available. Debugging data problems becomes more difficult. Data and schema migration becomes more difficult. You need to do everything through application code.

Alternative approaches to this application architecture have their own challenges and merits. I will explore them in future posts.




No comments:

Post a Comment