Posted on

PHP + PostgreSQL App Performance Example

PHP and PostgreSQLOne of our projects this week was to improve the performance in a payment processing system for a client. The system performed well under limited data sets, but as the data set grew larger the performance response time increased exponentially. In reviewing the application we found several design flaws that were significantly impacting performance. This article outlines some of the issues we found.

Overview

The take-away from this article is that you should know the details of the implementation from ALL angles.  If you are a code junkie, then make sure you review and understand the database angle.  Same thing in reverse if you are a DB guru, take the time to understand the code.    Nobody can do an effective job of application design & implementation without understanding the big picture.

In this case there is way too much emphasis on putting all the business intelligence into the database.   While there is nothing wrong with that, and in fact that is often a preferred architecture, it was not well thought out and thus not well implemented in this case.   One of the bigger mistakes was putting the business intelligence into simple views versus using proper stored procedure designs.

Bottom line, sometimes the better solution is to put SOME intelligence in the data engine and move some of the extraction/looping and other processing logic on the code side.  ESPECIALLY in a case like this one where we know that the client will not, at least any time soon, be accessing the data from any applications outside the server-side PHP application that is implemented for them.  Thus we know we could put all the intelligence in the code, though that makes for a LOT more work if/when they decide to introduce things like mobile apps.

Lesson 1: Don’t Put Multiple Sub-Selects In A View

This is a simplified example from a simple view that was built in Postgres for the client.

SELECT a,b,
to_char(c_date, (select settings.value from settings where settings.id='date_format')) as c_date_str,
to_char(d_date, (select settings.value from settings where settings.id='date_format')) as d_date_str,
to_char(e_date, (select settings.value from settings where settings.id='date_format')) as e_date_str
...

This view is using a settings table which holds the date format.  The client can use the web interface to change the date format, which is stored in the settings table.   That is a good web app design.

Doing multiple queries to retrieve that date format in a single view is a bad design.   In the simple example we have above we end up hitting the database, and thus doing disk I/O, no less than FOUR TIMES for a single query.  That is going to be slow.

There are a myriad of better options here, here are the two options I would consider:

  • Move the sub-select into a stored procedure and turn it into a function.An intelligent design of that procedure will retain the initial data fetch in a global variable that is tested on each call, blocking future data I/O requests.   Data I/O is now 2calls v. 4+ for the view.
  • Return the raw data.Allow the code to format the strings.   The code can easily fetch the date format and apply the equivalent PHP formatting call ONCE and apply it to all raw data data.  This also cuts down the data I/O.Using raw data also increases the chances for the PostgreSQL engine to optimize the query via the internal heuristics engine.
In our application improvement the path taken was to avoid this view whenever possible.  As it turns out, this view is so complex and convoluted that there are often multiple shortcuts that get to just the few data elements that are needed.  Constructing new queries retrieved the necessary data without all the view complexities and data overload.In this case the view is so complex that is hampers performance throughout the application and has limited benefit.    The long term solution will be to break the view into a subset of stored procedures.  For the few cases where the larger complex view is actually viable we will see improved performance via an intelligent series of cascading stored procedures or code-side logic.

 Lesson 2: Use Parameter Binding

Any modern database and their related interfaces will support data binding.  If your database does not support this and you are building enterprise-class applications it is time to select a new data engine.  PostgreSQL has supported this for years.   Nearly all of the data interfaces, including PHP’s MDB2 interface have also supported parameter binding for years. With parameter binding you will get a significant performance boost when iterating over data, especially in a nested loop fashion.
In our example the code was doing something similar to this, simplified for instructional purposes:
$qry1 = 'SELECT v_id,a,b,c,d FROM my_first_view WHERE NOT paid';
$result = $database->db()->query($qry1);
$dataset1 = $result->fetchAll(MDB2_FETCHMODE_ASSOC); 
$datastack = $dataset1;
$qry2 = 'SELECT v_id,e,f,g FROM my_second_view WHERE NOT paid';
$result = $database->db()->query($qry2);
$dataset2 = $result->fetchAll(MDB2_FETCHMODE_ASSOC);
foreach ($dataset1 as $data1) {
    foreach ($dataset2 as $data2) {
        if ($data1['v_id'] == $data2['v_id']) { 
             $datastack['subdata'][] = $data2; 
        }
    }
}

There are several significant performance issues here.   To start with there is significant memory consumption as we need to collect ALL the data from the database into memory.  We then collate the data from two complete sets in memory to create a single data element.    There are much more efficient ways to do this without fetching all data in memory first.

The better option would be to fetch the data from dataset1 on a row-by-row basis and push the data onto the stack one record at a time.  The inner loop for dataset2 should then select a subset of data that is specifically for the matching v_id from the outer dataset1 loop.   This is where parameter binding comes in.

Here is the same loop in untested simplified code format, using parameter binding.  In our real-world example this one simple change increased performance more than 50x because the database can be much more intelligent about how it selects subsets of data from the database & the PHP overhead both in memory and stack management is significantly reduced:

// give us parameter binding in MDB2 please
$database->db()->loadModule('Extended'); 

// setup our queries
$qry1 = 'SELECT v_id,a,b,c,d FROM my_first_view WHERE NOT paid';
$qry2 = 'SELECT v_id,e,f,g FROM my_second_view WHERE v_id = ?';
// get the "outer" data
// since we know we will use all the "outer" data, just put it
// directly on the data stack, cutting this memory consumption in half
$result = $database->db()->query($qry1);
$datastack = $result->fetchAll(MDB2_FETCHMODE_ASSOC);

// still loop through outer data to drive the 2nd query
foreach ($datastack as $data1) {
    // Fetch the data2 subset of matching data as
    // a named array, only getting those records that
    // match on v_id... in essence an SQL JOIN done
    // via code
    $database->db()->extended()->getAll(
        $qry2,
        null,
        array($data1['v_id']),
        array('integer'),
        MDB2_FETCHMOD_ASSOC
        );
    // Now we attach each of the matching elements in
    // the "inner" data set to the data stack, attaching
    // it under the related v_id 
    //
    foreach ($dataset2 as $data2) {
             $datastack['v_id'][$data1['v_id']]['subdata'][] = $data2; 
    }
}

This can be further refined and improved per our discussion above, but you get the idea.  I’ll leave it to you to figure out how to further refine the process.

You may be asking “why didn’t you just do a simple JOIN in the database engine?”  Good question.  The real world example is much more complex than this and some of the data elements and views in play make that solution complex to maintain and causes the database engine to trip-up on the optimization and is actually SLOWER in our real world case.   Here we are simplifying to illustrate the general concept only.

 Summary

A couple of simple real-world examples of improving performance have been illustrated here.    When refactoring a real-world application there are often complex interactions that need to be researched & evaluated.  The solutions are rarely simple and often can be approached in several ways.   The options shown here are not necessarily the best solutions but are the solutions that were the quickest to implement while providing a significant performance increase to the client.

Finding the balance between results and costs is always a challenge from the business perspective.    From the technical perspective a similar balance is often necessary between database and code intelligence.  Knowing your environment as a whole will help guide you down the proper path toward improving application performance.