One of the more common architectural tasks when designing a web based system that you expect to scale horizontally is deciding how to handle and store sessions. This is because each front end server is running its own PHP install and their session data is stored locally on the disk instead of being shared between environments. This creates a split brain problem between your servers.
Common solutions include forcing users to always be routed to the same web server (“sticky sessions”), so that they’ll end up back on the server that initiated and is storing their session data, and using a common storage area for the servers’ session data. Sticky sessions can be a pain to set up, decreases the fault tolerance of your front line web servers, can cause bad user experience if they get routed to the “wrong” server, and doesn’t take full advantage of any load balancing you’re doing.
There are some load balancers that will store sessions at their layer of the architecture, but we’ll assume you’ve got a fairly small architecture or don’t want to keep bolting on complexity.
Since your application likely already has a common persistence layer (a la database) that you put thought into making fault tolerant, and you obviously chose CouchDB, we’re going to look at storing your session data there. Here’s a quick break down of how we’re going to do it:
- We’re going to use PHP’s
session_set_save_handler()
, which lets us provide callbacks (functions) for opening, closing, reading, writing, destroying, and garbage collecting sessions. - Connecting to CouchDB will be done with Sag v0.4. Our session CRUD functions map nicely to HTTP’s verbs, which Sag directly exposes to us.
- Each session will be stored in its own document, using PHP’s session ID as the document’s
_id
. It will also store when the session was last written to for garbage collection purposes. We would be worried about index and database sizes when using large IDs, but garbage collection of expired sessions will constrain the file sizes nicely. And more sessions should mean more users, and therefore cash and fame, so that’s a nifty problem to have. - The database name will be the session name (from
session_name()
). That means we can have different applications, or sub-applications, using this methodology more easily by having them use their own session names. - Because there can be a lot of I/O operations on a session, we don’t want to have to go back to the database every time, especially when CouchDB’s MVCC architecture requires that we retrieve a document from the server before updating it. To protect against this we’re going to use Sag’s MemoryCache, which stores the document’s object in memory during the script’s execution. Sag’s caching uses Etags to cache the docs locally, much like your web browser does with web pages.
The Code: class CouchSessionStore
Refer to the code here: http://gul.ly/zr (GitHub)
CouchSessionStore
is set up to have little impact on your application’s code, exposing a series of static functions that will act as PHP’s session CRUD callbacks. These callbacks are provided to PHP with session_set_save_handler()
at the bottom of the file, after the class definition. One great place for improvement of this class would be to use the factory design pattern to set up CouchSessionStorage
and call session_set_save_handler()
, moving this work out of the global scope.
There’s an additional hook at CouchSessionStore::setSag($sag)
that accepts an initialized Sag object. This means you can specify a different SagCache implementation, use your own server info and credentials, etc., overwriting the default configuration. If you pass CouchSessionStore::setSag($sag)
NULL it will revert back to its default Sag configuration. The only thing that you cannot change through this hook is the database name: CouchSessionStore
will always set this to the PHP session name in lower case, decreasing the risk of bugs.
If you really want to use a different naming scheme you can extend CouchSessionStore
and re-implement setSag($sag)
, like this:
<?php
require_once 'CouchSessionStore.php';
class SuperCouchSessionStore extends CouchSessionStore
{
public static function setSag($sag)
{
//use CouchSessionStore to set everything up, so our $this->sag == $sag
parent::setSag($sag);
//overwrite the baked in database naming, creating it in Couch if it doesn't exist
$this->sag->setDatabase('super-database-name', true);
//obey our parent class's definitions
return $this->sag;
}
}
?>
Design Document Creation
One of the really neat things that we do is check whether our design document, the index that maps creation times to documents for easy garbage collection, exists when we open the session, creating it if it does not. This is a great example of the power a schemaless database gives you – we do not have to worry about deploying new schemas or too much about what the data will look like before developing our application.
This also allows us to roll out new design document code as we develop our application, baking our “schema” and querying into our application’s versioning. For example, instead of just checking whether the design document exists, we could retrieve it and compare its map reduce code to ours, sending the new code if it did not match. You could also define your application’s version into the design document and compare against that if you are worried about earlier code versions overwriting your newer versions’ code.
Comments
We moved off of Disqus for data privacy and consent concerns, and are currently searching for a new commenting tool.
Why in the world would you store session information in a database, never mind a production database where your core data is stored?!?!
That’s what memcache is for, and it can handle infinitely more IOPS than couchDB, or any other database, without the worry about impacting your database.
Even better, using PHP serialization, which the Memcache class does automatically, you can store objects in memcache for each session, not just a value.
Cheers
Randy
Hi Randy, thanks for your feedback. I’m always glad to see rebutting, because it means people are paying attention and there’s a chance to learn a new view point.
There could be many reasons for storing session data in the database to solve the split brain problem. Chief among them is that most web applications already have a database and might not be able to add a new piece of technology to their stack: their environment could be locked, their IT department could have too much red tape, etc. The biggest problem is that adding a new piece of technology to your stack is expensive regardless of price tag because you are increasing the complexity of your system, adding to the list of skills you have to retain on staff, and making it less fault tolerant (more moving pieces).
This is why I said, \”Since your application likely already has a common persistence layer (a la database) that you put thought into making fault tolerant, and you obviously chose CouchDB, we\u2019re going to look at storing your session data there.\” I was looking at the issue from a very common position – having to solve a problem with only the resources currently at your disposal. This methodology could be done with any database, but CouchDB plays so nicely with session data because both are schemaless and sessions map directly to documents. Historically people have often used MySQL for this, possibly under the impression that if you use one letter in LAMP you have to use them all, but that quickly gets messy.
As for serialization, PHP bakes this in for us: when it passes
CouchSessionStore
the session data it’s already serialized. That’s why we can store it as a string in the document. I probably should have called this out more explicitly.Lastly, Memcached is a great piece of software. However, there have already been articles written about it, storing session data in it, and connecting to it with PHP. One of Bocoup’s missions is to educate, so I figured I should give that a shot and talk about something that hasn’t been done before. I never meant to suggest that everyone should store session data in CouchDB and only CouchDB, but instead that doing so is super simple and a handy wrench for your tool box.
Cheers,
Sam
I would argue that 99% of CouchDB users (or users of ANY database engine for that matter) probably didn’t put any thought into making it scalable or fault tolerant.
Add to that a number of database hits equal to or greater than the number of page and resource hits, and they will experience a database crash without the depth of understanding to quickly assess WHY it happened.
For example, did they take into consideration that using this example, session information gets loaded not only for every PAGE that gets loaded, but every graphic, every flash file, every RSS feed, every iframe, every javascript file, etc, that the browser subsequently comes back to the server to load?
Now, what should be ONE hit to the session cache per page is conceivably 10x, 20x or 30x that number of database hits, for NO good reason.
Your custom session handler should be smart enough to know when to hit the session cache, and when it’s not needed. Proposing a solution that uses CouchDB (or any other persistence layer) needs to address these issues, or at least raise them as issues that are explicitly not addressed.
Under load, can your database handle 30x the number of queries? Does it need to? What’s the cost to the company when the database starts backing up connections because it’s flooded with cache requests; when customers start leaving; when investors start asking why the site is always down?
As an architect, I would demand at the very least, that there be a new instance of the database, on a different server, to handle session information and nothing else.
That said, IMHO, it would be sheer insanity to use a database to store session information UNLESS you plan to mine that session information for analysis, and even then, you should only save summary information to the database, not detail.
If you’re going to do that, make sure your database maintains tables in memory and only periodically flushes to disk, and your performance will be significantly improves.
Just my $0.025 (adjusted for inflation)
Cheers,
Randy
super dumb comment, any developer of modern javascript applications use ajax to call services and these services are called punctually not to load everything on the page…. That was well known even 8 years ago.
Hi Randy,
I’m not sure what you mean by your first statement, because the whole point of databases is to hold data, handle faults, and scale. I would recommend that you take a look at the CAP theorem, because these considerations are taken into account at the beginning of designing any database. Feel free to be more specific and I’ll try to respond.
Not every hit generates I/O: PHP sessions only start when you call start_session(), unless you set its configuration to auto-start them (not a recommended setting). I/O to CouchDB is further reduced in this case by using one of Sag’s caching mechanism, meaning that we don’t have to read from the database every time we want to write.
I/O is again reduced by the fact that the majority of our queries are to specific resources – we only use map/reduce for garbage collection.
\”For example, did they take into consideration that using this example, session information gets loaded not only for every PAGE that gets loaded, but every graphic, every flash file, every RSS feed, every iframe, every javascript file, etc, that the browser subsequently comes back to the server to load?\”
This is not the case. PHP sessions only run when PHP runs, which does not happen for images, scripts, etc. You might be getting confused by the fact that the browser will still send the PHP session cookie to the web server, which is why you should be hosting your static assets on an alternative sub-domain: the unnecessary cookies won’t be sent.
As to your other points, I would again point out that this methodology – regardless of the database technology being used – has been useful to myself and others in plenty of situations. Most likely if you’re running into the types of traffic numbers you describe you won’t be storing your session data in the database, but you are also more likely to be able to insert a new piece of technology into your stack. Again, this solution is useful for architectures that can’t do that and need to re-use an existing piece of their stack.
Cheers,
Sam
CAP Theorem? Really? I now see the err of my ways… I presumed we were operating in the real world, where 16 terabyte arrays use a single controller card that fails randomly, and network fault tolerance doesn’t REALLY exist because the three points of failure you put in place to ensure power continuity to the routers didn’t fail in the proper order, taking down a SAN switch in the middle of all those critical writes.
While I can see *why* you make this argument, I contend that databases are NOT designed to scale or be fault tolerant. They are designed to efficiently read and write data to and from a controlled storage medium, assuming that the data is properly indexed and compartmentalized. It’s the additional systems sitting on top of that functionality that provides the scalability and fault tolerance, assuming no limit to bandwidth, and no intolerable faults.
If they were, there would be no need for additional mechanisms that provide redundancy, as the databases would all have them inherently available. Log shipping would never have been developed, or mirroring, or any other \”scaling\” mechanism, as the database would just magically do it for you. They may be \”intended\” to be when designed, but then developers get in the way and muck the whole thing up.
You are correct that PHP will not automatically load the session on every asset load, except when you modify it in such a way, as we have, to pre-load a script (not an implicit session_start) and perform some preliminary processing to ensure that the session is accurately handled, etc., etc. Using the auto_prepend_file operative, which we employ to ensure that the environment is stable and sandboxed, crawlers are dealt with, etc., prior to any pages or resources getting loaded, databases getting hit, etc.
That’s a whole ‘nother talk.
I simply contend that if you’re going to make the argument that it’s easier to add servers and let distributed transactions and sharding handle the additional load of session management, perhaps applying Occam’s Razor would be a more efficient application of one’s time: Use memcache for session management and you never have to worry about what effect storing and retrieving session information to and from your database\u00a0is having on your ability to manage the *important* data.
For small tasks, sure, store it in the database. But be sure to leave your home number for the next guy to call when you’ve left the company, and traffic has grown to an unmanageable level because of the shortcuts that were used without consideration of their impact on future performance.
Cheers,
Randy
Hi Randy,
It seems that you’re trying to apply the article to your specific set up and have found that it won’t work well for you. That’s fine: no one suggested that there was one configuration to rule them all, or that this article is how everyone builds their systems. Maybe you’ll find this sort of solution useful on a different project, but in the mean time I’m happy for you because it sounds like you’ve got a huge budget, tons of hardware, and have full reign over your tech stack.
I don’t know why you’re calling PHP scripts when people request static assets, but then I don’t presume to know your application.
As to your personal philosophies about databases, they sound more like a description of a generic storage mechanism, but in the end has little to do with this article.
Cheers,
Sam