Tuesday, June 17, 2014

The benefits of PostgreSQL side JSON parsing

In my previous post I talked about the benefits of directly rendering a JSON response from a PostgreSQL stored function or SQL statement in general.
In this post I want to deal with the opposite requirement: parsing a JSON string directly inside the database engine mainly for the purpose of performing a write transaction. I want to show the benefits of doing such a thing by showing an, albeit admittedly simplified, example.
There are two main advantages in parsing JSON inside the database in order to update it:
  • you can save several round trips from your application to the database if all the data needed for the transaction is sent to the database in a single shot. The amount of saving depends directly on the complexity of the JSON you need to process. The more complex is the JSON, the more savings we reap from this technique, mainly in terms of diminished latency;
  • you can expose JSON collections as relations inside your transaction and process the transaction in a set oriented manner, the way relational databases prefer to process data. This means you can get much better performance.
So, no more talking, let's describe our test setting. The example is a customer order acquisition. It is supposed orders are gotten from a web application somewhere. Clients send back their orders in a JSON format. An example JSON order is:

{
  "cust_id": 223, 
  "ord_date":"2014-06-02", 
  "payc_id": 22, 
  "rows": [{
     "rowno": 1, 
     "prod_id": 24, 
     "qty": 2, 
     "price": 5.50
    },{
     "rowno": 2, 
     "prod_id": 27, 
     "qty": 1, 
     "price": 12.50}]
}

Given this, the application server logic is trivial. The server has to take the JSON coming from the web request and send it as a parameter in a SQL call to PostgreSQL performing some sort of authentication and validation presumably. Normally the application server may not even need to parse the JSON string received as a request parameter. This setting is also totally stateless. Once the database request has been processed no state needs to be kept on the server in order to respond to the next request from the same client. This is also true when we render the JSON from inside the database as described in the other post.
The database tables used in the example represent a simple design in order to keep things tractable.
Here is the script that creates the tables and populates them with sample data:


CREATE TABLE paycond(
    paym_id serial NOT NULL PRIMARY KEY,
    paym_desc text NOT NULL );
INSERT INTO paycond(
    paym_desc)
SELECT
    'Payment method #' || s
FROM
    generate_series(1, 50) AS g (s);
CREATE TABLE cust(
    cust_id serial NOT NULL PRIMARY KEY,
    cust_name text NOT NULL,
    paym_id INTEGER NOT NULL REFERENCES paycond);
INSERT INTO cust(cust_name, paym_id)
SELECT
    'Cust #' || s, ceil(random() * 50)
FROM
    generate_series(1,1000) AS g(s);

CREATE TABLE ord_hdr(
    ord_id serial NOT NULL PRIMARY KEY,
    ord_date date NOT NULL DEFAULT CURRENT_DATE,
    cust_id INTEGER NOT NULL REFERENCES cust,
    paym_id INTEGER NOT NULL REFERENCES paycond,
    ord_amt numeric (12,2)
    NOT NULL DEFAULT 0);

CREATE TABLE prod(
    prod_id serial NOT NULL PRIMARY KEY,
    prod_name text NOT NULL,
    prod_price numeric(8, 2) NOT NULL);
INSERT INTO prod(
    prod_name,
    prod_price)
SELECT
    'Prod #' || s, round((random()* 100)::numeric, 2)
FROM
    generate_series(1, 1000) AS g(s);
CREATE TABLE ord_details(
    ord_id INTEGER NOT NULL REFERENCES ord_hdr ON DELETE CASCADE,
    rowno SMALLINT NOT NULL,
    prod_id INTEGER NOT NULL REFERENCES prod,
    qty numeric(8, 2) NOT NULL,
    price numeric(8, 2) NOT NULL,
    amt numeric(12, 2) NOT NULL,
    CONSTRAINT ord_details_pk PRIMARY KEY (ord_id, rowno));

The code for the (PL/PgSQL) function that processes the order acquisition is:

CREATE FUNCTION json_test(val json) RETURNS INTEGER AS $$
BEGIN
    WITH hdr_data AS(
        SELECT (val ->> 'cust_id')::INTEGER AS cust_id,
            (val ->> 'ord_date')::date AS ord_date,
            (val ->> 'payc_id')::INTEGER AS paym_id,
            val - > 'rows' AS details),
    rows_coll AS (
        SELECT json_array_elements(details) AS r
        FROM hdr_data),
    details AS (
        SELECT (r ->> 'rowno')::SMALLINT rowno,
            (r ->> 'prod_id')::INTEGER prod_id,
            (r ->> 'qty')::DECIMAL qty,
            (r ->> 'price')::DECIMAL price
        FROM rows_coll AS r),
    hdr AS (
        INSERT INTO ord_hdr(
            cust_id,
            ord_date,
            paym_id )
    SELECT
        cust_id,
        ord_date,
        COALESCE(paym_id, (SELECT paym_id FROM cust WHERE cust_id = hdr_data.cust_id))
    FROM hdr_data RETURNING ord_id)
    INSERT INTO ord_details(
        ord_id,
        rowno,
        prod_id,
        qty,
        price,
        amt )
    SELECT
      ord_id,
      rowno,
      prod_id,
      qty,
      COALESCE(price, prod.prod_price),
      COALESCE(price, prod.prod_price) * qty
    FROM
      hdr,
      details
      JOIN prod USING(prod_id);

    UPDATE ord_hdr
    SET ord_amt = (
      SELECT sum(amt)
      FROM ord_details
      WHERE ord_id = currval('ord_hdr_ord_id_seq'::regclass))
    WHERE ord_id = currval('ord_hdr_ord_id_seq'::regclass);
    
    RETURN currval('ord_hdr_ord_id_seq'::regclass);
END;
$$ LANGUAGE 'plpgsql';

This code needs explanation. It takes advantage of some PostgreSQL features, some of which are available on other databases. First the parameter is a JSON value. The caller can send a string casting it to JSON and this string will be parsed and sent to the function as a JSON value. An example call may be:


SELECT json_test(
        '{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{
        "rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},{
        "rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json);

Then we use a CTE (Common Table Expression) to decompose our JSON message into normalized terms that can be used as relations. So, to generalize the approach, let's consider that our input JSON is a master-detail structure, potentially with many detail sections containing further detail sections, a tree of sub-relations that starts from a root node (this approach works also if the root node is itself a collection but the code is a bit different in this case).
The first thing we want to do is take the root and expose the various data items at this level as the expected data types in order to use them as input values to an insert statement. PosthgreSQL offers us the ->> operator which accesses a single term from a JSON value and returns it as a string of characters. So for example val ->> 'cust_id' returns the cust_id value in the JSON as a string. We can then cast the string to the appropriate data type as in (val ->> 'cust_id')::integer to get the final input value.
For each collection embedded in the root node we want to expose it as a JSON array value. In order to do this we use the -> operator that return the JSON value corresponding to the referred member. So in our case val -> 'rows' returns a JSON array of the order rows.
This covers the first CTE. Once we have the JSON for a collection, in order to be able to use it we need to transform it into a relation of JSON objects. This is the task accomplished by the second CTE, rows_coll. Here we use the json_array_elements function to perform the transformation. Now that we have a relation of JSON objects we can apply the algorithm recursively to expose these objects as a flattened relation using the ->> operator. This is done in the third CTE, details, where we apply the same logic we used for the root to the order rows. The pattern can be repeated to any depth.
Now we have perfect master and details relations we can use to perform an INSERT into the ord_hdr and ord_details relations. So we use another feature of PostgreSQL, the ability to expose as a CTE DML statements and to pipeline the output of one statement to the next statements following it. Here we INSERT into the ord_hdr relation from the hdr_data relation we obtained from the first CTE. We can complicate the INSERT business logic as we want joining to other tables. For example here we enter into the order the payment method recorded in the customer row when it is missing in the JSON order we receive from the client. We use the RETURNING clause of the INSERT statement to return the ord_id value assigned from the sequence associated with the serial primary key. We need it in order to be able to fill the foreign key in the ord_details table. The next INSERT statement is the last member of this multi-operation SQL statement and inserts the rows as a set into the ord_details table, joining to the hdr CTE result, that in this case is a single row and single column relation containing the assigned order id. Incidentally, we also join to the prod table to default the price in case it is missing in the input JSON.
Now, in order to complete our order we need to perform a last step. We want to store on the ord_hdr relation the order total amount which is the sum of the amt columns for all the ord_details belonging to the same ord_hdr. So we may think we could add an UPDATE operation to compute the sum straight in the current statement but unfortunately this does not work. If we do this the ord_hdr.ord_amt column remains zero. 
The reason for this is that Postgres processes the set of CTEs and the corresponding final INSERT operation as a single statement. The planner creates a single plan for the whole computation. While this is very good in performance terms, it has a drawback, a necessary drawback. All the operations in the statement see the state of the database as it was when the statement started so any change made by DML inside a CTE is not visible to the following CTEs and final operation. This means that while we could add a RETURNING clause to the last INSERT making it into a CTE to return the amt of the newly inserted rows, there would be no way to UPDATE the ord_hdr row we inserted before since that row is not visible until the statement ends.
While we could in principle complicate things and compute the ord_amt from the JSON details, my objective is to keep things manageable and readable and to be able to automate the process of automatically generating this SQL from a representation of the underlying business rules. So I preferred to let the statement end and add a new "classical" UPDATE statement that performs the required logic. Since the statement is independent it has full visibility of the newly inserted rows and can perform the computation. There is only a little problem: we need to be able to access the newly inserted ord_id which, outside the previous statement, is no longer accessible. Now, Postgres allows you to access the last value assigned from a sequence as currval('sequence_name'::regclass), where sequence_name is the name of the sequence associated with the ord_id serial column. This returns the last assigned value in the current session for the given sequence and is fully safe to use in a concurrent environment. So we can use this expression to access the last assigned ord_id and update our order completing our transaction logic.
We return this ord_id to the caller to allow the client to fetch the processed order from the database.

Now you may want to know how fast this order processing function is. Performing this kind of evaluation is in general a tricky thing. On my development system that sports an Intel i7 processor, 12 GB RAM and an SSD Patriot Inferno 128GB disk unit I got this result:

First execution (no data in cache, function not compiled):

"Result  (cost=0.00..0.26 rows=1 width=0) (actual time=14.324..14.325 rows=1 loops=1)"
"  Output: json_test('{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{"
""rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},"
"{"
""rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json)"
"Total runtime: 14.348 ms"

Subsequent executions:

"Result  (cost=0.00..0.26 rows=1 width=0) (actual time=1.225..1.225 rows=1 loops=1)"
"  Output: json_test('{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{"
""rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},"
"{"
""rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json)"
"Total runtime: 1.236 ms"

My tests show a consistent total runtime around 1msec for repeated executions with this schema and content (all data is in memory). This is generally difficult to beat in a ORM setting where all this logic is scattered in many methods performing many SQL statements to get the same result. Here you pay a single round trip to the database (not accounted for in the timings) while in an ORM setting you probably would need several, depending on the level of optimization you can afford to hack in. A single round trip normally costs much more than 1ms in a production server setting.
I also wanted to compute the impact of JSON processing on the whole statement execution time. I used this SQL to make this test:


EXPLAIN ANALYZE VERBOSE 
WITH t AS (
    SELECT '{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{
        "rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},{
        "rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json val),
testata AS (
    SELECT (val->> 'cust_id')::INTEGER AS cust_id,
        (val - >> 'ord_date' ) ::date AS ord_date,
        (val ->> 'payc_id')::INTEGER AS paym_id,
        val -> 'rows' AS details
    FROM t),
rows_coll AS (
    SELECT json_array_elements(details) AS r
    FROM testata),
details AS (
    SELECT (r ->> 'rowno')::SMALLINT rowno,
        (r ->> 'prod_id')::INTEGER prod_id,
        (r ->> 'qty')::DECIMAL qty,
        (r ->> 'price')::DECIMAL price
    FROM rows_coll AS r)
SELECT *
FROM details;



All the CTEs are evaluated and I measure an average execution time of 0,1 msec, which corresponds to a maximum of 10% JSON processing overhead.
Considering the savings in latency I expect this approach to be significantly faster in terms of response time and more scalable for many real world workloads.
In my opinion this approach is also simpler than an imperative ORM based solution for the same functionality but my objective is to make this code a compilation target for a business rules based compiler I'm working on since this representation is not easy to manage on a larger scale in terms of code volumes and repetition.
I will talk about this in a future post.

Saturday, May 31, 2014

The benefits of PostgreSQL side JSON rendering

Nowadays JSON has become the standard Web data interchange format. No matter if your client runs in a browser or as a native mobile app, or you are accessing a web service from a server side component, you are probably going to model and exchange data in the JSON format. JSON parsing and rendering is available in every modern programming language and JSON has been the NoSQL databases format of choice. JSON is also supported as a native datatype in several relational databases. In PostgreSQL there is a JSON datatype and a set of operators for constructing, accessing and processing JSON data from within the RDBMS. 

What is compelling for my purposes is the ability to return a JSON document to the client directly from a SQL query that accesses relational data. This means if you have a complex JSON that would require multiple round-trips to the database to build, now it can be built right in a single SQL statement. This can be 10 to 100 times faster than an equivalent ORM-based solution depending on various factors and in my opinion is easier to code and maintain (you need to be proficient in SQL to appreciate this).

I wanted to create a fictitious example and, inspired by this other post by Dan McClain, I wanted to try a similar schema and a radically simplified SQL to see how it performs.
I did not understand why Dan designed tags to be unique by note. In this case the JSON he wants to produce is somewhat redundant, since if tags are unique by note, they can be embedded inside notes. If they aren't, then why tags have a note_id attribute?
In my design notes have tags but different notes can be tagged with the same tag. So there is an intersection table to manage the many-to-many relationship between notes and tags:


CREATE TABLE notes (
    note_id serial NOT NULL PRIMARY KEY,
    note_title text NOT NULL,
    note_content text NOT NULL );
INSERT INTO notes (
    note_title,
    note_content )
SELECT 'Note #' || s,
    'Lorem ipsum...'
FROM
    generate_series (1, 1000) AS g (s)
;
CREATE TABLE tags (
    tag_id serial NOT NULL PRIMARY KEY,
    tag_name text NOT NULL);

INSERT INTO tags (tag_name)
SELECT 'Tag #' || s
FROM generate_series (1, 100) AS g (s)
;
CREATE TABLE note_tags (
    note_id INTEGER NOT NULL REFERENCES notes,
    tag_id INTEGER NOT NULL REFERENCES tags,
    CONSTRAINT nota_pk PRIMARY KEY (note_id, tag_id));

INSERT INTO note_tags
SELECT
    note_id,
    tag_id
FROM
    notes,
    tags
WHERE random ( ) <= 0.1;

I'm going to get a page of 40 notes. I want to show only tags that are referenced by the notes in the message. This JSON returns a dataset that could not be extracted by a single SQL statement if executed in a traditional manner. Normally it would take multiple queries to materialize this JSON in an application program, with a somewhat convoluted procedure. If you use a vanilla ORM it may take up to 141 queries to materialize the data to produce this JSON. By preloading all the tags in a single query we can reduce the number of queries to 42. This works because we only have 100 tags, but may not work with a larger data set. The logic for filtering out only the referenced ones may add complexity to the solution. We could reduce the number of queries to 3 by fetching all note_tags for the fetched notes in a single query, further complicating the code. Eagerly fetching note_tags with notes may be simpler but we get redundant data over the wire. Notes data is gotten 10 times instead of once. This may zero out the benefit of avoiding a nested query (not in this case since we have few columns to read, but in a more realistic situation it may actually make things worse if notes had many columns).
So, we can see that, on the application side, generating the required JSON is not trivial and may involve a lot of optimization decisions and hacks in the code to be efficient.
Now I present a PostgreSQL 9.3 query that returns the required result. It is far simpler, IMHO, than the solution presented by Dan:


WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    row_to_json (
        ROW (
            (
                SELECT
                    json_agg(t)
                FROM (
                        SELECT
                            DISTINCT tags.tag_id,
                            tags.tag_name
                        FROM
                            note_tags nt
                            JOIN tags USING (tag_id)
                            JOIN n USING (note_id)) AS t ),
            (
                SELECT
                    json_agg (n)
                FROM
                    n)))

On my desktop computer (Intel i7, 12GB RAM), this query takes 5,1 milliseconds to execute. Interesting to note, the cost of JSON rendering is small, something around 0,6ms. This data can be computed from the explain analyze verbose of the statement execution plan. In order to have a realistic execution time you have to add the call latency between an hypothetical application server and the database, which in a well engineered network may be in the same order of magnitude (5ms).
The most optimized application level solution will be at least 6 times slower than this, due to latencies and redundant application level processing and data format translations. But I presume normally sloppy programmers will prefer to write more readable and maintainable code so they may opt for the first vanilla ORM solution which may be 100 times slower. Maybe this goes into production unnoticed and if the user population is small and tolerant, this may actually be acceptable (users start perceiving a latency when response time is at least 0.7 seconds).
But the SQL solution is quite simple and enjoyable. It is 17 lines of code, with no mutable variables, loops, assignments and conditionals. Completely declarative code that is quite easy to understand. The CTE pulls the notes each one with it's own  array of tags (the array_agg call). The main select constructs a JSON object from a SQL ROW that is made of two columns: the first one gets all the referenced tags by joining with the CTE result and aggregating the tags rows as a JSON array; the second column is the aggregation as a JSON array of the CTE result. No messing with performance related problems. Let the database optimize the execution for us.
This solution is fast, scales well and is easy to program.
There is anyway a problem with this solution as of PostgreSQL 9.3. The ROW construct causes output columns to be named f1 and f2 instead of tags and notes. I've not found a workaround but in 9.4 there is a plethora of new JSON functions. Among these there are new constructor functions like json_build_object(VARIADIC any) which is what we need in our case:

WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    json_build_object (
        "tags",
        (
            SELECT
                json_agg(t)
            FROM (
                    SELECT
                        DISTINCT tags.tag_id,
                        tags.tag_name
                    FROM
                        note_tags nt
                        JOIN tags USING (tag_id)
                        JOIN n USING (note_id)) AS t ),
        "notes",
        (
            SELECT
                json_agg (n)
            FROM
                n))

I've not tested this, anyway.
To conclude, JSON powered relational databases show interesting possibilities:
- significant performance improvements from moving the code into the data processing layer;
- code simplification. Just express the result you want and let the database do it's magic.

In a future post I will explore the possibility of using JSON as a means to efficiently perform database updates. Stay tuned.


Wednesday, May 21, 2014

The object-relational impedance mismatch

In order to describe the object-relational impedance mismatch we need to define the architectural assumptions that underlie the problem:
  • the application interacts with data managed by a relational database;
  • the application is written in an object oriented language (Java, C#, ruby, python et cetera);
  • the application data model is implemented as a set of classes written in the programming language of choice.
This architecture has a big appeal in theory. Your desire is to be able to interact with data using your preferred programming language instead of being forced to use two different languages: SQL to talk to the database and your OO language to do anything else.
Since relational databases only speak SQL, you are forced to create  a layer of indirection between your class based model and the corresponding relational design managed by the RDBMS. Normally this is achieved through some tool called ORM (Object Relational Mapper). There are many ORMs on the market characterized by various degrees of sophistication (e.g. Hibernate, Microsoft Entity Framework, Ruby based ActiveRecord, ...). These tools mainly automate the repetitive task of implementing these functions:
  • create a class for each table in the relational schema, or create a table for each class on the OO model (I'm simplifying here, the mapping may not be 1 to 1);
  • supply you with suitable factory methods to create instances;
  • supply getter methods to access data coming from the relational table columns;
  • supply methods to perform queries that return single instances of classes materialized from data in the database, or collections of them;
  • create methods that allow you to navigate the data model following the inter-table foreign keys, retrieving instances or collections of related objects;
  • create setter methods to update the in-memory representation of a model object;
  • manage sessions (isolation of multiple users  making concurrent changes);
  • update the database from in-memory data and manage transactions.
Armed with this functionality you can write any business logic you can imagine just using your OO programming language and being able to ignore the RDBMS. Here I'm not considering fundamental aspects like data model evolution which significantly complicates the matter.

What's wrong with this approach?
Let's first explore a symptom of the problem. 
ORMs are like drugs. They start by charming you with their ease of use. As confidence grows on you, you keep writing more code. Then, when you are addicted to them, they betray you. Everything starts from you writing the basic CRUD functionality of your application. For simple CRUD transactions running on a developer workstation, response time is in the order of the milliseconds. The developer normally does not perceive the latent danger he is experiencing since a few milliseconds seem a very fast and reasonable response time, and other application components are normally the real response time bottleneck at this scale. But when you start implementing more complex transactions that read and modify a significant amount of data, all of a sudden you experience seriously degraded response times. And when your application goes production, things get nastier. Latencies increase due to the fact that the database is on another machine, probably with a firewall in between. This exacerbates the problem causing additional delays and unsatisfied users.

The fundamental reason why this architecture is weak is that it forces the programmer to re-implement database processing tasks in the application server tier. Since classes are mapped to the normalized relational model, the application, in order to reconstruct the user view of a document, is forced to act as a query processor. Classes are rigid containers. Reading an object means reading all of the table columns for a single row (I know some ORMs allow some optimizations but those are weak solutions) . You may only need to fetch one column value from the object, but you need to materialize all of it, since your unit of retrieval is class based. Normally a well designed relational model spreads information on many tables in order to correctly implement the application business rules (which generally also means minimizing the cost of updates). So in practice this means your application, while may be easy to write with a handy navigational API, will be fetching lots of rows by primary key over the network. With this approach the database optimizer is out of service, since it only sees a very small picture of what you are trying to accomplish in your transaction as a whole. 
You are reinventing the database in your application program:
  • every time you sort data you fetched directly or indirectly from the RDBMS;
  • every time you use a Map to aggregate detail information;
  • every time you use a Set to eliminate duplicates;
  • every time you fetch related data using accessors (application level join);
  • every time you create a new collection by filtering an existing collection;
  • every time you construct a non-persistent object that is the projection of data coming from one or more persistent classes;
  • every time you create a new collection by concatenating multiple collections.
Every time you do these things, and a few others, in the context of data fetched from the RDBMS, you are acting as the human optimizer, writing data processing your own way. The problem is your code does not have access to the data at the same speed as the RDBMS. Accessing one row from the application server is at least two orders of magnitude slower than doing it from inside the RDBMS kernel.
It is like having a shopping list and going to the supermarket and back home for each item in it instead of picking the whole list in a single round trip (this analogy is not mine).

Also, RDBMS data processing algorithms have been fine tuned in years and years of development. For example, sorting data in the application program is not even in the same league as sorting the same data in the RDBMS kernel, at least in my experience.
The RDBMS also knows about your data since it collects lots of statistics about your tables and columns. It can perform thousands of evaluations of different approaches at answering your query or performing your update and choose the one that results to be the cheapest. Chances are on average it will outperform you as a programmer. And optimizers improve at each new RDBMS release.
If you have a multi-tenant production system, chances are your data distribution will differ significantly from one tenant to another. The RDBMS optimizer will be able to change the algorithm used to process a task from tenant to tenant, given the same request you send to the RDBMS. Your hard coded program instead won't be able to adapt.
Another big problem coming from this architecture is that on the application server you normally have limited resources at your disposal, in particular when running inside a virtual machine like a JVM. Processing large amounts of data inside a JVM poses garbage collection problems that further impact end user perceived response time.

The impedance mismatch is not only a performance problem. It is clearly also a software maintenance problem. You write in an imperative language what you could instead declare in SQL. SQL is language that needs practice and has many flaws but has also many good qualities. I will talk about this in detail in a future post but it is clear that if you can avoid coding imperative loops, if-then-else, handle mutable variables and object state and NullPointerExceptions, this is a big win for software quality.

A dual data model representation is difficult to manage. If your data management code is embedded in the application, you can't take advantage of it when you have to interact directly with your data in the RDBMS. It is also more difficult to expose data to other applications that need to interact with the database, for example for reporting and analysis purposes, because derived data needs to be stored, or you won't have the derivation code available. Debugging data problems becomes more difficult. Data and schema migration becomes more difficult. You need to do everything through application code.

Alternative approaches to this application architecture have their own challenges and merits. I will explore them in future posts.




Wednesday, May 14, 2014

When you should pick a relational database

In this post I'd like to underline the reasons why you should choose to use a relational database for your next application, and, in contrast, the reasons why you shouldn't.

Not many years ago, applications were fundamentally supporting businesses for boring things like accounting, warehouse management and so on. In the 80's relational databases were still hot (and expensive) technology, but in the 90's their usage was a given. Any application that needed to process business data stored the data in relational form.
In the 90's we saw the unstoppable growth of object oriented languages. OO languages allow you to create data models in the application programming language, independently from databases. This started the trend that came next. 
Before OO languages the tools we had to build applications were specifically tailored for relational data processing. I could make some examples: former PowerSoft PowerBuilder, Oracle Forms, Gupta SQLWindows, Borland Delphi, Microsoft Visual Basic. What these tools had in common was an approach where data processing was delegated to the RDBMS engine, often in the form of stored procedures, while the application was simply displaying the data through a graphical user interface. The so called client/server architecture.

With the advent of OO languages programmers felt it easier to model data in their programming language than leaving the processing of the data to the database.
The reasons for this are still relevant today:
  • by implementing the data model in the application you can write code that is not dependent on the specific RDBMS vendor and your code can be easily ported from one RDBMS to another;
  • application code is far easier to debug than stored procedures;
  • you need not learn the intricacies of SQL...
In order to move data from application memory to the database, ORM tools, like Hibernate, were implemented (actually I wrote one myself). This encouraged the perception of simplicity that is prevalent today: using an ORM makes the database transparent and allows you to concentrate on coding the business logic.
It is now clear that things are not that simple. ORMs come with a lot of complexity which is hidden when you start coding simple transactions but becomes relevant when your task is to perform non trivial data processing on significant data volumes or you need to scale to a large user population.
I'd like to talk about the impedance mismatch between OOP and the relational model in another post since it is a huge and generally misunderstood topic. Instead here I want to suggest a different perspective on when relational databases matter and why OO data models in application programs are often a bad choice.

The fundamental problem I see with the OO approach to data modelling is what I would call the "survival presumption". I have been developing business software for thirty years. I started using relational databases in 1984 when Oracle version 5 came out on an IBM PC AT which had 10MB of hard disk drive and 640KB of main memory. You had to add an expensive memory expansion board in order to reach the 1,64 MB of RAM that Oracle required to run.
Since then I have been using relational databases and, guess what I have been able to save from one application generation to the next: the data and the database schema. I have been developing for dumb terminals, Unix workstations, MS-DOS PCs, Windows, Linux and the browser based applications in the original and Web 2.0 styles and can tell you it is far more probable that in a few years you will be replacing your programming language than the database your business is running on. SQL is the only know-how that I have been able to carry on from 1984. For this reason learning to use it well is the best skill you can invest on for your future.
The problem with modelling data in the application programming language is that it is short sighted. Programmers with little experience are selfish and think databases are needed in order to "persist" their beautifully designed objects. So, no wonder they find relational databases slow and inflexible for their needs. This explains in part the success of document oriented databases. They allow programmers to persist their objects with basically zero overhead and minimal code. And retrieving the same objects is simple and amazingly fast.
If this is what you need, then you should definitely use document databases, or even flat files for that matter. Obviously document databases offer capabilities that file systems don't have.

But relational databases have a far greater scope than persisting objects. 
First, relational databases model data independently from application programs and the programming languages used. This explains in part why it's harder from a programmer perspective to interact with a relational database. But programmer's convenience is not a greater priority than data independence (I think programmer's convenience can be obtained in another manner, I will talk about this in the future). It is far more important to be free to interact with the database from any programming language and be able to migrate to new technology in a seamless manner. 
Data and metadata never becomes irrelevant because data can be moved and reprocessed differently. On the other hand, programs have a well-defined life cycle. Once they are technologically obsolete they have to be replaced.
It is also far more important that multiple applications be able to share access to the same data concurrently and with different access patterns. If data is designed from the perspective of one application, it will not be able to fit well to the requirements of another application you design tomorrow that needs the same data but in a different "shape". Relational databases are very capable of accommodating new requirements and new access mechanisms to existing data. Other kind of databases are far less capable of doing the same.
At the same time it is far more important for users to be able to use general purpose tools to work with their data outside the application boundary. If data consists of "serialized objects", the risk is users will not be able to understand them and use them outside the application that saved them.
Data must be designed for users, to survive programs and be adaptable.
Another reason relational databases are not persistent object managers is constraints. A relational database consists of relations and constraints. Bad data is of no good use because when you have bad data you can't trust the result of your queries. So if you base your decisions on bad data, you can take the wrong decisions.
Since relational databases are meant to be used to share data access across applications, constraints should be implemented in the database itself to ensure that, no matter which client is performing a transaction, data will be consistently validated and correctly updated. If we leave this to application programs inconsistencies can occur from code duplication. There is also another valid reason to process constraints in the database: it is very difficult to guarantee the correct transaction isolation semantics in application programs. Database transactions should be written by knowledgeable people once and for all. It is easier to guarantee correctness of transactions in presence of high concurrency when constraints are implemented in the database. And it should be clear that this approach also improves performance and scalability dramatically.
Yet today this approach is not common. 
I personally think it is a cultural problem coming mainly from the dominant education on OOP and imperative programming.
But is also a problem that comes from the lack of good relational database development tools.
This is the focus of my current research and development activity and I will talk about this in future posts.