Saturday, May 31, 2014

The benefits of PostgreSQL side JSON rendering

Nowadays JSON has become the standard Web data interchange format. No matter if your client runs in a browser or as a native mobile app, or you are accessing a web service from a server side component, you are probably going to model and exchange data in the JSON format. JSON parsing and rendering is available in every modern programming language and JSON has been the NoSQL databases format of choice. JSON is also supported as a native datatype in several relational databases. In PostgreSQL there is a JSON datatype and a set of operators for constructing, accessing and processing JSON data from within the RDBMS. 

What is compelling for my purposes is the ability to return a JSON document to the client directly from a SQL query that accesses relational data. This means if you have a complex JSON that would require multiple round-trips to the database to build, now it can be built right in a single SQL statement. This can be 10 to 100 times faster than an equivalent ORM-based solution depending on various factors and in my opinion is easier to code and maintain (you need to be proficient in SQL to appreciate this).

I wanted to create a fictitious example and, inspired by this other post by Dan McClain, I wanted to try a similar schema and a radically simplified SQL to see how it performs.
I did not understand why Dan designed tags to be unique by note. In this case the JSON he wants to produce is somewhat redundant, since if tags are unique by note, they can be embedded inside notes. If they aren't, then why tags have a note_id attribute?
In my design notes have tags but different notes can be tagged with the same tag. So there is an intersection table to manage the many-to-many relationship between notes and tags:


CREATE TABLE notes (
    note_id serial NOT NULL PRIMARY KEY,
    note_title text NOT NULL,
    note_content text NOT NULL );
INSERT INTO notes (
    note_title,
    note_content )
SELECT 'Note #' || s,
    'Lorem ipsum...'
FROM
    generate_series (1, 1000) AS g (s)
;
CREATE TABLE tags (
    tag_id serial NOT NULL PRIMARY KEY,
    tag_name text NOT NULL);

INSERT INTO tags (tag_name)
SELECT 'Tag #' || s
FROM generate_series (1, 100) AS g (s)
;
CREATE TABLE note_tags (
    note_id INTEGER NOT NULL REFERENCES notes,
    tag_id INTEGER NOT NULL REFERENCES tags,
    CONSTRAINT nota_pk PRIMARY KEY (note_id, tag_id));

INSERT INTO note_tags
SELECT
    note_id,
    tag_id
FROM
    notes,
    tags
WHERE random ( ) <= 0.1;

I'm going to get a page of 40 notes. I want to show only tags that are referenced by the notes in the message. This JSON returns a dataset that could not be extracted by a single SQL statement if executed in a traditional manner. Normally it would take multiple queries to materialize this JSON in an application program, with a somewhat convoluted procedure. If you use a vanilla ORM it may take up to 141 queries to materialize the data to produce this JSON. By preloading all the tags in a single query we can reduce the number of queries to 42. This works because we only have 100 tags, but may not work with a larger data set. The logic for filtering out only the referenced ones may add complexity to the solution. We could reduce the number of queries to 3 by fetching all note_tags for the fetched notes in a single query, further complicating the code. Eagerly fetching note_tags with notes may be simpler but we get redundant data over the wire. Notes data is gotten 10 times instead of once. This may zero out the benefit of avoiding a nested query (not in this case since we have few columns to read, but in a more realistic situation it may actually make things worse if notes had many columns).
So, we can see that, on the application side, generating the required JSON is not trivial and may involve a lot of optimization decisions and hacks in the code to be efficient.
Now I present a PostgreSQL 9.3 query that returns the required result. It is far simpler, IMHO, than the solution presented by Dan:


WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    row_to_json (
        ROW (
            (
                SELECT
                    json_agg(t)
                FROM (
                        SELECT
                            DISTINCT tags.tag_id,
                            tags.tag_name
                        FROM
                            note_tags nt
                            JOIN tags USING (tag_id)
                            JOIN n USING (note_id)) AS t ),
            (
                SELECT
                    json_agg (n)
                FROM
                    n)))

On my desktop computer (Intel i7, 12GB RAM), this query takes 5,1 milliseconds to execute. Interesting to note, the cost of JSON rendering is small, something around 0,6ms. This data can be computed from the explain analyze verbose of the statement execution plan. In order to have a realistic execution time you have to add the call latency between an hypothetical application server and the database, which in a well engineered network may be in the same order of magnitude (5ms).
The most optimized application level solution will be at least 6 times slower than this, due to latencies and redundant application level processing and data format translations. But I presume normally sloppy programmers will prefer to write more readable and maintainable code so they may opt for the first vanilla ORM solution which may be 100 times slower. Maybe this goes into production unnoticed and if the user population is small and tolerant, this may actually be acceptable (users start perceiving a latency when response time is at least 0.7 seconds).
But the SQL solution is quite simple and enjoyable. It is 17 lines of code, with no mutable variables, loops, assignments and conditionals. Completely declarative code that is quite easy to understand. The CTE pulls the notes each one with it's own  array of tags (the array_agg call). The main select constructs a JSON object from a SQL ROW that is made of two columns: the first one gets all the referenced tags by joining with the CTE result and aggregating the tags rows as a JSON array; the second column is the aggregation as a JSON array of the CTE result. No messing with performance related problems. Let the database optimize the execution for us.
This solution is fast, scales well and is easy to program.
There is anyway a problem with this solution as of PostgreSQL 9.3. The ROW construct causes output columns to be named f1 and f2 instead of tags and notes. I've not found a workaround but in 9.4 there is a plethora of new JSON functions. Among these there are new constructor functions like json_build_object(VARIADIC any) which is what we need in our case:

WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    json_build_object (
        "tags",
        (
            SELECT
                json_agg(t)
            FROM (
                    SELECT
                        DISTINCT tags.tag_id,
                        tags.tag_name
                    FROM
                        note_tags nt
                        JOIN tags USING (tag_id)
                        JOIN n USING (note_id)) AS t ),
        "notes",
        (
            SELECT
                json_agg (n)
            FROM
                n))

I've not tested this, anyway.
To conclude, JSON powered relational databases show interesting possibilities:
- significant performance improvements from moving the code into the data processing layer;
- code simplification. Just express the result you want and let the database do it's magic.

In a future post I will explore the possibility of using JSON as a means to efficiently perform database updates. Stay tuned.


Wednesday, May 21, 2014

The object-relational impedance mismatch

In order to describe the object-relational impedance mismatch we need to define the architectural assumptions that underlie the problem:
  • the application interacts with data managed by a relational database;
  • the application is written in an object oriented language (Java, C#, ruby, python et cetera);
  • the application data model is implemented as a set of classes written in the programming language of choice.
This architecture has a big appeal in theory. Your desire is to be able to interact with data using your preferred programming language instead of being forced to use two different languages: SQL to talk to the database and your OO language to do anything else.
Since relational databases only speak SQL, you are forced to create  a layer of indirection between your class based model and the corresponding relational design managed by the RDBMS. Normally this is achieved through some tool called ORM (Object Relational Mapper). There are many ORMs on the market characterized by various degrees of sophistication (e.g. Hibernate, Microsoft Entity Framework, Ruby based ActiveRecord, ...). These tools mainly automate the repetitive task of implementing these functions:
  • create a class for each table in the relational schema, or create a table for each class on the OO model (I'm simplifying here, the mapping may not be 1 to 1);
  • supply you with suitable factory methods to create instances;
  • supply getter methods to access data coming from the relational table columns;
  • supply methods to perform queries that return single instances of classes materialized from data in the database, or collections of them;
  • create methods that allow you to navigate the data model following the inter-table foreign keys, retrieving instances or collections of related objects;
  • create setter methods to update the in-memory representation of a model object;
  • manage sessions (isolation of multiple users  making concurrent changes);
  • update the database from in-memory data and manage transactions.
Armed with this functionality you can write any business logic you can imagine just using your OO programming language and being able to ignore the RDBMS. Here I'm not considering fundamental aspects like data model evolution which significantly complicates the matter.

What's wrong with this approach?
Let's first explore a symptom of the problem. 
ORMs are like drugs. They start by charming you with their ease of use. As confidence grows on you, you keep writing more code. Then, when you are addicted to them, they betray you. Everything starts from you writing the basic CRUD functionality of your application. For simple CRUD transactions running on a developer workstation, response time is in the order of the milliseconds. The developer normally does not perceive the latent danger he is experiencing since a few milliseconds seem a very fast and reasonable response time, and other application components are normally the real response time bottleneck at this scale. But when you start implementing more complex transactions that read and modify a significant amount of data, all of a sudden you experience seriously degraded response times. And when your application goes production, things get nastier. Latencies increase due to the fact that the database is on another machine, probably with a firewall in between. This exacerbates the problem causing additional delays and unsatisfied users.

The fundamental reason why this architecture is weak is that it forces the programmer to re-implement database processing tasks in the application server tier. Since classes are mapped to the normalized relational model, the application, in order to reconstruct the user view of a document, is forced to act as a query processor. Classes are rigid containers. Reading an object means reading all of the table columns for a single row (I know some ORMs allow some optimizations but those are weak solutions) . You may only need to fetch one column value from the object, but you need to materialize all of it, since your unit of retrieval is class based. Normally a well designed relational model spreads information on many tables in order to correctly implement the application business rules (which generally also means minimizing the cost of updates). So in practice this means your application, while may be easy to write with a handy navigational API, will be fetching lots of rows by primary key over the network. With this approach the database optimizer is out of service, since it only sees a very small picture of what you are trying to accomplish in your transaction as a whole. 
You are reinventing the database in your application program:
  • every time you sort data you fetched directly or indirectly from the RDBMS;
  • every time you use a Map to aggregate detail information;
  • every time you use a Set to eliminate duplicates;
  • every time you fetch related data using accessors (application level join);
  • every time you create a new collection by filtering an existing collection;
  • every time you construct a non-persistent object that is the projection of data coming from one or more persistent classes;
  • every time you create a new collection by concatenating multiple collections.
Every time you do these things, and a few others, in the context of data fetched from the RDBMS, you are acting as the human optimizer, writing data processing your own way. The problem is your code does not have access to the data at the same speed as the RDBMS. Accessing one row from the application server is at least two orders of magnitude slower than doing it from inside the RDBMS kernel.
It is like having a shopping list and going to the supermarket and back home for each item in it instead of picking the whole list in a single round trip (this analogy is not mine).

Also, RDBMS data processing algorithms have been fine tuned in years and years of development. For example, sorting data in the application program is not even in the same league as sorting the same data in the RDBMS kernel, at least in my experience.
The RDBMS also knows about your data since it collects lots of statistics about your tables and columns. It can perform thousands of evaluations of different approaches at answering your query or performing your update and choose the one that results to be the cheapest. Chances are on average it will outperform you as a programmer. And optimizers improve at each new RDBMS release.
If you have a multi-tenant production system, chances are your data distribution will differ significantly from one tenant to another. The RDBMS optimizer will be able to change the algorithm used to process a task from tenant to tenant, given the same request you send to the RDBMS. Your hard coded program instead won't be able to adapt.
Another big problem coming from this architecture is that on the application server you normally have limited resources at your disposal, in particular when running inside a virtual machine like a JVM. Processing large amounts of data inside a JVM poses garbage collection problems that further impact end user perceived response time.

The impedance mismatch is not only a performance problem. It is clearly also a software maintenance problem. You write in an imperative language what you could instead declare in SQL. SQL is language that needs practice and has many flaws but has also many good qualities. I will talk about this in detail in a future post but it is clear that if you can avoid coding imperative loops, if-then-else, handle mutable variables and object state and NullPointerExceptions, this is a big win for software quality.

A dual data model representation is difficult to manage. If your data management code is embedded in the application, you can't take advantage of it when you have to interact directly with your data in the RDBMS. It is also more difficult to expose data to other applications that need to interact with the database, for example for reporting and analysis purposes, because derived data needs to be stored, or you won't have the derivation code available. Debugging data problems becomes more difficult. Data and schema migration becomes more difficult. You need to do everything through application code.

Alternative approaches to this application architecture have their own challenges and merits. I will explore them in future posts.




Wednesday, May 14, 2014

When you should pick a relational database

In this post I'd like to underline the reasons why you should choose to use a relational database for your next application, and, in contrast, the reasons why you shouldn't.

Not many years ago, applications were fundamentally supporting businesses for boring things like accounting, warehouse management and so on. In the 80's relational databases were still hot (and expensive) technology, but in the 90's their usage was a given. Any application that needed to process business data stored the data in relational form.
In the 90's we saw the unstoppable growth of object oriented languages. OO languages allow you to create data models in the application programming language, independently from databases. This started the trend that came next. 
Before OO languages the tools we had to build applications were specifically tailored for relational data processing. I could make some examples: former PowerSoft PowerBuilder, Oracle Forms, Gupta SQLWindows, Borland Delphi, Microsoft Visual Basic. What these tools had in common was an approach where data processing was delegated to the RDBMS engine, often in the form of stored procedures, while the application was simply displaying the data through a graphical user interface. The so called client/server architecture.

With the advent of OO languages programmers felt it easier to model data in their programming language than leaving the processing of the data to the database.
The reasons for this are still relevant today:
  • by implementing the data model in the application you can write code that is not dependent on the specific RDBMS vendor and your code can be easily ported from one RDBMS to another;
  • application code is far easier to debug than stored procedures;
  • you need not learn the intricacies of SQL...
In order to move data from application memory to the database, ORM tools, like Hibernate, were implemented (actually I wrote one myself). This encouraged the perception of simplicity that is prevalent today: using an ORM makes the database transparent and allows you to concentrate on coding the business logic.
It is now clear that things are not that simple. ORMs come with a lot of complexity which is hidden when you start coding simple transactions but becomes relevant when your task is to perform non trivial data processing on significant data volumes or you need to scale to a large user population.
I'd like to talk about the impedance mismatch between OOP and the relational model in another post since it is a huge and generally misunderstood topic. Instead here I want to suggest a different perspective on when relational databases matter and why OO data models in application programs are often a bad choice.

The fundamental problem I see with the OO approach to data modelling is what I would call the "survival presumption". I have been developing business software for thirty years. I started using relational databases in 1984 when Oracle version 5 came out on an IBM PC AT which had 10MB of hard disk drive and 640KB of main memory. You had to add an expensive memory expansion board in order to reach the 1,64 MB of RAM that Oracle required to run.
Since then I have been using relational databases and, guess what I have been able to save from one application generation to the next: the data and the database schema. I have been developing for dumb terminals, Unix workstations, MS-DOS PCs, Windows, Linux and the browser based applications in the original and Web 2.0 styles and can tell you it is far more probable that in a few years you will be replacing your programming language than the database your business is running on. SQL is the only know-how that I have been able to carry on from 1984. For this reason learning to use it well is the best skill you can invest on for your future.
The problem with modelling data in the application programming language is that it is short sighted. Programmers with little experience are selfish and think databases are needed in order to "persist" their beautifully designed objects. So, no wonder they find relational databases slow and inflexible for their needs. This explains in part the success of document oriented databases. They allow programmers to persist their objects with basically zero overhead and minimal code. And retrieving the same objects is simple and amazingly fast.
If this is what you need, then you should definitely use document databases, or even flat files for that matter. Obviously document databases offer capabilities that file systems don't have.

But relational databases have a far greater scope than persisting objects. 
First, relational databases model data independently from application programs and the programming languages used. This explains in part why it's harder from a programmer perspective to interact with a relational database. But programmer's convenience is not a greater priority than data independence (I think programmer's convenience can be obtained in another manner, I will talk about this in the future). It is far more important to be free to interact with the database from any programming language and be able to migrate to new technology in a seamless manner. 
Data and metadata never becomes irrelevant because data can be moved and reprocessed differently. On the other hand, programs have a well-defined life cycle. Once they are technologically obsolete they have to be replaced.
It is also far more important that multiple applications be able to share access to the same data concurrently and with different access patterns. If data is designed from the perspective of one application, it will not be able to fit well to the requirements of another application you design tomorrow that needs the same data but in a different "shape". Relational databases are very capable of accommodating new requirements and new access mechanisms to existing data. Other kind of databases are far less capable of doing the same.
At the same time it is far more important for users to be able to use general purpose tools to work with their data outside the application boundary. If data consists of "serialized objects", the risk is users will not be able to understand them and use them outside the application that saved them.
Data must be designed for users, to survive programs and be adaptable.
Another reason relational databases are not persistent object managers is constraints. A relational database consists of relations and constraints. Bad data is of no good use because when you have bad data you can't trust the result of your queries. So if you base your decisions on bad data, you can take the wrong decisions.
Since relational databases are meant to be used to share data access across applications, constraints should be implemented in the database itself to ensure that, no matter which client is performing a transaction, data will be consistently validated and correctly updated. If we leave this to application programs inconsistencies can occur from code duplication. There is also another valid reason to process constraints in the database: it is very difficult to guarantee the correct transaction isolation semantics in application programs. Database transactions should be written by knowledgeable people once and for all. It is easier to guarantee correctness of transactions in presence of high concurrency when constraints are implemented in the database. And it should be clear that this approach also improves performance and scalability dramatically.
Yet today this approach is not common. 
I personally think it is a cultural problem coming mainly from the dominant education on OOP and imperative programming.
But is also a problem that comes from the lack of good relational database development tools.
This is the focus of my current research and development activity and I will talk about this in future posts.





Tuesday, May 13, 2014

Renumbering rows in SQL

Today I struggled around a seemingly simple problem the would be quite easy to solve in a traditional imperative programming language but is not so immediate to solve in SQL so I want to share my solution.

I have a master table and a master_lines table where lines should be  numbered from 1 to N, where N is the count of lines per master.
For some reason that does not matter in this context, a program inserted lines for a given master leaving large gaps between lines and I had to renumber those lines, keeping them in the same sort order, since I was getting numeric overflow errors in the line_number column which had been unfortunately declared numeric(3, 0). The program was continuing to increase the line_number starting from the highest maximum read from the database and adding nearly random offsets from there. (yeah, an integer column probably would have solved this problem, but I could not change the database).
The user was complaining and I had to apply a quick fix to allow him to proceed entering new lines for that master row.

Since this was a PostgreSQL database I knew I could renumber the lines easily using the row_number window function. 
Window functions are a nice SQL feature that often allows you to solve order dependent problems in SQL. SQL is a set oriented language, it's purpose is to process relations, which are sets, and sets have no predefined order. So it's generally difficult to express order dependent logic in SQL. Window functions can compute values on windows of contiguous rows returned from relational operations applied by a query. They are not relational algebra operators themselves but they allow you to post-process the result of relational algebra operations.

Among the set of window functions, row_number returns an integer starting from 1 to N where N is the cardinality of the result set (or size of the window to be correct). This is exactly the functionality we need. Since the result of a window function depends on the sort order of the result set, we need to specify in which order we want to count the rows. This is achieved adding a "window" clause to the statement, which can also be written inline with the function call. This clause has many features but for our purposes we only need to be able to specify the ordering we need. So we can write 

row_number() over (order by line_number)

to request a sort by line_number and then return the position of the "current row" in the sorted set, counting from 1. To update the rows for a given master I wish I could write:

update master_lines o
set line_number = row_number() over (order by line_number) 
where master_nbr = '14/08925'
;

but this does not work since SQL window functions are not supported in UPDATE statements. Semantically it may have sense but is not allowed. This is one example of lack of orthogonality still present in SQL. In order to solve this I need to perform the operation in two steps, first computing the row number and then using this result to update the master_lines table. I may use an inline view for this but I find it more elegant and readable to use a CTE (Common Table Expression) for this purpose.
A CTE is like an assignment in a functional programming language. If we consider that SQL operands are relations (Chris Date, forgive me for the imprecision, I know I should say that SQL allows duplicates and so on...), a CTE like

with t as (
  query
)

is nothing more than an assigment to the final variable t, of an immutable value as defined by the result of executing the query. It is a relation in the relational model sense (an immutable value). So I can legally reference it inside other parts of the same SQL statement. In our case we can write:

with t as (
  select line_number, 
    row_number() over (order by line_number) as pos
  from master_lines
  where master_nbr = '14/08925'
)
update master_lines o
set line_number = t.pos
from t
where master_nbr = '14/08925'
and o.line_number = t.line_number
;

This solution can be used whenever you want to renumber a set of rows according to some arbitrary sort order.