Shift8 Creative Graphic Design and Website Development

mongodb

MongoDB in the Cloud

Posted by Tom on Sun, Sep 18 2011 12:58:00

So I'm growing tired of configuring servers. I simply don't have the time...Between designing sites/apps and then actually coding them on top of all sorts of project management, dealing with clients, getting paid, etc. There's just no time with all the projects I work on. So I'm looking more and more toward various PaaS solutions (platform as a service). I've been looking at RedHat's solution as well as VMWare's CloudFoundry and Orchestra.io as well. Only Orchestra has support for PHP currently (aside from RedHat's service) as well as support for MongoDB with PHP. I'm sure I'll have a comparison/review/my two cents for those services later.

However, today I'm going to talk about two hosted MongoDB solutions that I've come across. When I was at the MongoSV conference the other year I met some of the vendors there and MongoHQ really stuck in my head. There was also another company there, Mongo Machine. They go about their pricing differently. Mongo Machine is more cost up front so I haven't tried it to be honest. I'm not sure I will either. If I'm at the point of putting in that much money, then I'm going to just host my own database on my own servers. 

A side note. Hosting your own MongoDB (single or cluster) is going to likely yield better performance, especially with your cost/performance ratio. Plus you gain control over what your server setup is like. So there's two reasons why I'd suggest one of these services. First, it's amazingly simple to setup and manage and you don't need to worry about scaling. So convenience is number one. The second reason is more of a scenario. When you don't need super performance because your traffic isn't as high or maybe you're not doing anything as intensive (lots of map/reduces, etc.), it's probably a very good idea to use one of these services. Think about your own personal website. It's likely that you can have a free database solution for your own blog or something. Pretty cool!

Back on point. I've signed up for and have created two databases on MongoHQ and also MongoLab. MongoLab either was not at the MongoSV 2010 conference or I didn't see them...But I like them. So I'm going to compare these two services because they are extremely similar in pricing models, interface, and everything. 

Getting Started
It took all of 5 minutes to setup both services. MongoHQ I had to enter a credit card number, MongoLab I did not.

Configuration
I'm going to talk about Lithium here since that's what I use for a PHP framework (and you should probably too, if I can be an advocate for a few seconds here). Setting up Lithium to use MongoDB on MongoHQ and MongoLab was easy...Once you know what your config array should look like and also once you realize that you have to set the default timeout from 100ms to something higher, like a few seconds. The port number has to be in the host key value. The login key has to be set as well as the password key. This process was identical for both services.

However... Here's an interesting difference. MongoLab has several options when it comes to "where" you're hosting your database(s). You can choose to use Amazon EC2, Rackspace Cloud, or Joyent. This is a major point to MongoLab over MongoHQ. The reason is because one of my gripes is with performance. This is mainly due to hostname lookups and such. If you so happen to be using Rackspace for your hosting (again, I'm going to be a fan for them) you can use the private IP that's within their network! This should (I haven't tried it because I'm on Slicehost and have yet to move my server) help with the timeout setting that you just had to increase in your configuration. I imagine the same goes for Joyent and Amazon EC2 when choosing those locations for your database as well.

MongoHQ uses Amazon EC2 exclusively and you do not get to choose. I'm not sure which region or how it's setup, but I imagine if you also use Amazon EC2, you may get a performance bump when using MongoHQ when it comes to connecting to the database.

Pricing
The pricing models are the same. They both offer a free tier, but, what you get for what you pay is much different between the two. Not incredibly different...Except for when it comes to the free tier. You can likely run an entire blog/personal web site off MongoLab for free because they give you 240MB for free while MongoHQ gives you 16MB for free. Both then have similar plans, but not identical. You're talking about $5/mo differences here and there depending on which tier you fall into between the two services. Nothing to worry about.

It's only when you get into needing replication that things start to get different. MongoLab gives you replication on their plans that cost money. MongoHQ has it available, but you have to pay $300/mo to get it. MongoLab you get it even with their $10/mo plan.

Both have backups, that's cool. Both seem to be monitored, etc. What I'm not sure about is if MongoLab has you on a dedicated instance. MongoHQ lets you know that you are when you hit the mid range to higher end plans. The only thing MongoLab has is their dedicated plan which has a variable cost and I imagine you'd need to get in touch with them to price that.

So I'm not sure what that all means, but it could definitely affect performance. Maybe MongoLab can shed some light on that...Or maybe bcause MongoLab has it all replicated, the dedicated instance per account isn't as important because they are scaling with MongoDB's features.

I have to say neither wins the pricing category. They are comparable, but if I had to choose...I'd say MongoLab because their free tier is better.

Features
I have to say that I think MongoLab is going to have better features here. 

Update: Anyone who previously read this section would have seen info about how MongoLab offers replication. Which it does, but I have been informed it is not replica-sets. Meaning the failover is not automatic. Replica-sets are offered on their higher tier plan. This is more consistent with MongoHQ. Not knowing much about the internals of both services, I can't say if one is better than the other when it comes to scaling.

Here's another important note. MongoHQ seems to be running on an old version of MonogDB depending on which pricing plan you choose. You could be on 1.6.x or you could be on 1.8.x. 1.6.x is a bit old considering 2.0 is now out. MongoLab uses 1.8.x for everything. The "micro" instance on MongoHQ runs a 32-bit instance of MongoDB where the rest are 64-bit. MongoLab appears to use 64-bit for everything, but I could be wrong. They don't explicitly state that anywhere. Do you "need" 64-bit? No, not for 16MB or 240MB of storage.

Both services have a nice interface for browsing and even editing documents in your database. I love both. They have import/export features and it's everything you'd want. I don't think either service is better than the other when it comes to your database browser. MongoLab has the whole dark theme thing going while MongoHQ has a light design...If that matters to you. The important thing to note here is that MongoHQ allows you to hook up any database to their GUI. You can't do that with MongoLab. It's minor, but a cool feature.

Both services also have a REST API. This is neat for mobile apps, etc. Situations where you don't have acccess to a MongoDB driver...Which I think is pretty rare, but you never know. However, what MongoLab's API doesn't seem to do (which I wish it did) is allow you to deploy new databases. Being able to setup new databases via an API might make for a very nice solution when it comes to certain applications. For example, you may wish to create a service where every user who signs up pays you and each of those users gets their own database for security reasons and also so you can track their usage so you can in turn, charge them money to recover your hosting costs. Automation on that, would be great. MongoHQ appears to let you do that with their API.

Both services show you your database stats as well. I think MongoLab presents them with a little more helpful info which is good if you're new to MongoDB. MongoLab allows you to profile things as well whereas MongoHQ doesn't have anything like that built into their GUI. I imagine you could code your own profiling tools within your app though.

Overages
There's "soft" limits or overages that you can run into with both services. They both seem to be fair with this, but you need to be on top of your databases and if you are moving out of one tier, you need to switch your plan to the next tier. However, it's important to note that MongoHQ only has a soft limit on their high end plan. This means your database will not accept any more writes until you upgrade. I can only imagine for (just) the free plan on MongoLab the same is true. Given that you don't need to enter a credit card, I'm not sure how they would allow you to just keep using more and more. You get charged overages with the other tiers until you can switch plans. 

Nothing to really worry about with both services, but I think MongoLab handles things a little bit more nicely in case you're one of those people who don't really pay attention to your database and how it may be growing in size.

Reliability/Performance
This is something I can't speak about. I haven't used either service long enough to know. It can be completely possible that MongoHQ performs better and is more reliable than MongoLab. That would ultimately be the deciding factor for me to use MongoHQ over MongoLab, despite the pricing and features. It's simply more important than a database stays up.

My Choice
Both are very comparable services and are great. The differences between the two are minor and unless you're familiar with MongoDB, you likely won't really know what the differences are. You may not even care.

I personally will continue to play with both services since they have a free tier, but I am leaning toward MongoLab. I think they are definitely a service to follow and use. If you have a personal site that you want to use MongoDB, then I'd suggest trying one of these services. If you're on shared hosting, you likely need to use one of these services in order to use MongoDB. 

I think they also currently have the edge due to four reasons -- listed in order of what's most important to me.
#1 Choice of EC2, Rackspace, or Joyent with private IPs for (hopefully) better performance
#2 Replication
#3 Their free tier gives you more storage space
#4 Version 1.8.x of MongoDB

How About Just Reduce in Lithium?

Posted by Tom on Tue, May 24 2011 23:32:00

So after realizing my big problem with schema design for my little metrics example, I've changed the design a bit. Obviously if you store all this visitor data on one document, you're going to exceed the 4MB limit. So this was not just an example for me, this was to be actual code! Yikes! I'm glad I caught it a few hours later. So now imagine a schema where your "metrics" collection has a bunch of documents with all the same recorded information. Illustrated like so:

/* 0 */
{
  "_id": {
    "$oid": "4ddc78d27f8b9ada44000002"
  },
  "browser": "Firefox",
  "browser_ver": 4,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.2@example"
}

/* 1 */
{
  "_id": {
    "$oid": "4ddc78d97f8b9ade44000004"
  },
  "browser": "Chrome",
  "browser_ver": 11,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.3@example"
}

/* 2 */
{
  "_id": {
    "$oid": "4ddc78dd7f8b9adc44000001"
  },
  "browser": "Chrome",
  "browser_ver": 11,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.1@example"
}

Again this example is specific for some of my needs. I still want unique metrics on a per project basis in the system. So the _key field has a unique index on it. That simply takes care of that, whereas before I was using the $set operator, now I just save without worry. The slight difference is this time there won't be any updating of information if the user came back with a different browser later on. Like before the metrics aren't skewed because each "unique" visitor only has one entry per page/project tracked. Also like before, the downside is this isn't truly unique...It's more of a "per household" or a "per office" situation.

I now just have to make a very simple query to get all documents where the project_url is "example" in this case. This will eventually return a huge number of documents. Far too many to just pull up and loop using PHP each time I want an aggregate. So we don't need a map reduce now. We could use one, but it's overkill. We just want "half" of that...The reduce part. 

So we use the group() command and it kinda all works out to be similar in nature as the map/reduce. Again with Lithium, we want to get the connection using $db = ProjectMetric::connection(); and then we can run $db->connection->command() calls. Also don't forget the MongoCode class. We'll need that again for the reduce and finalize part of this.

Here's the juicy stuff.

$db = ProjectMetric::connection();

// We just need a reduce, not a map reduce in this case, it's actually a group() command
// We're also going to use a finalizer to format the data afterward.
$reduce = new \MongoCode("function(doc, prev) { ".
    // count browsers
    // first if the browser doesn't exist, set a key name for it with a value of 0
    "if(typeof(prev.browsers[doc.browser]) == 'undefined') {".
       "prev.browsers[doc.browser] = 0;".
    "}".
    // then increment
    "prev.browsers[doc.browser] += 1;".
    
    // count operating systems
    "if(typeof(prev.operating_systems[doc.os]) == 'undefined') {".
       "prev.operating_systems[doc.os] = 0;".
    "}".
    "prev.operating_systems[doc.os] += 1;".
    
    // count mobile devices... this is easier it's just one value
    "if(doc.mobile === true) {".
        "prev.mobile += 1;".
    "}".
    
    // count languages
    "if(typeof(prev.languages[doc.lang]) == 'undefined') {".
       "prev.languages[doc.lang] = 0;".
    "}".
    "prev.languages[doc.lang] += 1;".
"}");

// Make the output pretty, we don't want any spaces in key names for example...
// So we're taking those key names and setting them as values of a new "name" key.
// What if extract() was used or something? Those spaces would be an issue.
$finalize = new \MongoCode("function(prev) { ".
    // browsers
    "for (var x in prev.browsers) {".
        "prev.browsers.push({ name: x, count: prev.browsers[x] });".
        "delete prev.browsers[x];".
    "}".
    
    // operating systems
    "for (var x in prev.operating_systems) {".
        "prev.operating_systems.push({ name: x, count: prev.operating_systems[x] });".
        "delete prev.operating_systems[x];".
    "}".
    
    // languages
    "for (var x in prev.languages) {".
        "prev.languages.push({ name: x, count: prev.languages[x] });".
        "delete prev.languages[x];".
    "}".
    
    // mobile is different and is fine as is
    
    // return the data
    "return prev;".
"}");

// Run the command
$metrics = $db->connection->command(array(
    'group' => array(
        'ns' => 'project_metrics',
        'key' => 'project_url',
        'initial' => array('browsers' => array(), 'operating_systems' => array(), 'languages' => array(), 'mobile' => 0),
        '$reduce' => $reduce,
        'condition' => array('project_url' => $url),
        'finalize' => $finalize
    )
));

// Set the results, they'll be formatted how we want them to be
$results = $metrics['retval'][0];

// We can get unique visitors basically by the total number of documents pulled back
// because there is a unique index on the collection, It's on the "_key" field which
// takes the visitor ip and combines it with the "project_url" value. So there can't
// be duplicate entries so unique visitors is simply the total count.
$results['unique_visitors'] = $metrics['count'];

I commented it pretty well (honestly for my own benefit haha), but I'll briefly explain. The $db->connection->command(array('group'... part. We need to define "ns" which is the collection name, we also need to specify the "key" which in this case is the project url. This key is what to group the documents on.  So all documents with a "project_url" key of a given value are going to be grouped together. If we wanted to retrieve aggregate data for multiple URLs then this would come in handy. In this example, the "condition" key is going to kinda ruin the whole grouping thing because we're only going to return documents of the value we're grouping by. So one group always. Not several groups of data per URL.

The magic parts are the $reduce, initial, and finalize keys.  The reduce is actually another reduce function using the MongoCode class. So that's very similar to map/reduce. Same goes for the finalize function. The initial key is new. That essentially sets some default values for us to use. These will persist so as we loop we can easily keep adding on to them. In the most basic situation we have this "mobile" key set to 0. In the reduce function we're just incrementing it by one everytime the mobile value from the document (also named "mobile") is true. Simple. The others are also just counters.

The final results returned have array key names that have spaces in them. Ew. We don't want that in PHP. So I just made a simple finalize function that went through and moved the value (the count) to an array for each item to count that has a "count" key and a "name" key. The value for the "name" key of course can have spaces in it and there won't be any problems for PHP. It is possible to have spaces in the key names, but if we wanted to run extract() or something, there'd be issues.

That's about it. Another example. I'm not sure exactly how practical these examples are, but they do show you how you can make these operations in PHP when using Lithium. Once I wrap my head around things a bit more, I may try to give back a little and take a stab at making methods within the MongoDb adapter for these things.

On a final note. The group() command is designed to be executed in "real time" whereas map/reduce is more of a background task. So, back to practicality. This may still not serve my needs as the collection grows because the query could end up taking a long time. At which point I guess I'll have another map/reduce example to show for.

Map/Reduce in Lithium for Visitor Metrics

Posted by Tom on Tue, May 24 2011 11:01:00

Update: So after I got a little further along with this very example in a real life project I realized that while it makes for a very simplistic illustration of map/reduce (that I personally found helpful when learning how to perform map/reduce), it is not a very good real life example. The reason being... Smile ...The document size limit in MongoDB. Doh! You couldn't store metrics like this. However, ignoring the purpose of this, you can still continue reading about how to perform a map/reduce within Lithium.

Original Blog Entry

I'll start off by saying I love MongoDB and map/reduce after putting it off for some time. I dreaded learning the map reduce functions big time. It turns out, it's not that bad. A friend asked me to explain it in 10 words or less. So I did. It's not really all encompasing of the features, but it's a real good example for what map/reduce can do for you.

Use JavaScript to identify/"map" data to loop it to aggregate/"reduce."

Ok, so that's 12 words technically, I cheated by adding slashes and combinging two words. It's also really poor grammar. Anyway, that's the idea. In this example, I wanted to collect information about visitors on a web app. Obviously I'm not a masochist, I'd use Google Analytics if I could...Sadly, I could not. So what to do? Well, we can use MongoDB to record all this data and then use map/reduce to get some totals.

I may eventually turn this into a Lithium library (especially because there's a good browscap and language detection class that I'm not illustrating here), but for now I'm going over things at a high level and focusing on the actual map/reduce process.

That said, imagine a data set like this:

"metrics": {
    "pageviews": 63,
    "visitors": {
      "192-168-126-1": {
        "ip_address": "192.168.126.1",
        "browser": "Chrome",
        "browser_major_version": 11,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      },
      "192-168-126-2": {
        "ip_address": "192.168.126.2",
        "browser": "Firefox",
        "browser_major_version": 4,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      },
      "192-168-126-3": {
        "ip_address": "192.168.126.3",
        "browser": "Chrome",
        "browser_major_version": 11,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      }
    }

Now, we have this "metrics" field where ever you like, but in my case on a document that contains some other information. Why not a separate "metrics" collection? We could and then we could also put in things like page URLs that were hit on the site to start getting analytic information about the pages on our site. In my case, I just wanted to get a sense for some high level information about my visitors. For now.

So the first thing here that you'll notice (and I've written about the $set operator before) is that each IP address is the key for each entry. The dots have been replaced with dashes so that it works as a key. Otherwise, I'd have a pretty deep object on my hands. Surprised

So each time a page is loaded the pageviews count goes up and the visitor's browser information is captured using $set so that if the user from the same IP address came back again with a different browser, it would update. My metrics would not be skewed. Yes, it's sad that we don't realize when/if the user actually uses two different browsers...More sad that we're likely counting entire office buildings as one user, but that's just how the cookie crumbles in this case.

Ok, so we have that data and we have some controller action in our Lithium project that's going to return to us an array that we'll pass to the view template to make some pretty pie charts. Why not pie charts? I love pie charts, they give everyone a sense of satisfaction that looking at numbers is really fun! ...Or something like that.

We'll dive right in. Here's the entire action I'm using with the map/reduce code. Note that Lithium's MongoDb adapter does not have any options for map/reduce in the find() or any other method. I may write something in the future for that myself if I end up doing enough of these (and I likely will). However, we can make straight up command() calls from it.

 public function metrics($url=null) {
        if(empty($url)) {
            return false;
        }
        
        $db = Project::connection();
        
        // construct map and reduce functions
        $map = new \MongoCode("function() { ".
            "emit(this.metrics.visitors, this.metrics.visitors);".
        "}");
        
        $reduce = new \MongoCode("function(k, vals) { ".
            "var visitors = vals[0];".
            "var unique_visitors = 0;".
            "var b_counts = new Array();".
            "var browsers = new Array();".
            "var os_counts = new Array();".
            "var operating_systems = new Array();".
            "var mobile_devices = 0;".
            "var ln_counts = new Array();".
            "var languages = new Array();".
            
            // loop all the emitted visitor metrics to aggregate some data
            "for (var i in visitors) {".
                // count browsers
                "if(typeof(b_counts[visitors[i].browser]) == 'undefined') {".
                    "b_counts[visitors[i].browser] = 0;".
                "}".
                "b_counts[visitors[i].browser] += 1;".
                
                // count operating systems
                "if(typeof(os_counts[visitors[i].operating_system]) == 'undefined') {".
                    "os_counts[visitors[i].operating_system] = 0;".
                "}".
                "os_counts[visitors[i].operating_system] += 1;".
                
                // count the primary languages
                "if(typeof(ln_counts[visitors[i].primary_language]) == 'undefined') {".
                    "ln_counts[visitors[i].primary_language] = 0;".
                "}".
                "ln_counts[visitors[i].primary_language] += 1;".
                
                // count the number of mobile devices
                "if(visitors[i].mobile_device == true) {".
                    "mobile_devices += 1;".
                "}".
                
                // count the number of unique visitors
                "unique_visitors += 1;".
            "}".
            
            // loop browsers counted and set for output
            "for (var x in b_counts) {".
                "browsers.push({ name: x, count: b_counts[x] });".
            "}".
            
            // loop operating systems counted and set for output
            "for (var x in os_counts) {".
                "operating_systems.push({ name: x, count: os_counts[x] });".
            "}".
            
            // loop languages counted and set for output
            "for (var x in ln_counts) {".
                "languages.push({ name: x, count: ln_counts[x] });".
            "}".
            
            // return the output
            "return { 'browsers': browsers, 'operating_systems': operating_systems, 'languages': languages, 'mobile_devices' : mobile_devices, 'unique_visitors': unique_visitors }; }");
        
        $metrics = $db->connection->command(array(
            'mapreduce' => 'projects', 
            'map' => $map,
            'reduce' => $reduce,
            'out' => array('merge' => 'mapReduceMetrics')
        ));
        
        $cursor = $db->connection->selectCollection($metrics['result'])->find()->limit(1);
        foreach ($cursor as $doc) {
            $results = $doc['value'];
        }
        
        // Get the total page views for this project
        $pageviews = Project::find('first', array('fields' => array('metrics.pageviews'), 'conditions' => array('url' => $url)));
        $results['pageviews'] = $pageviews->data('metrics.pageviews');
        
        return $results;
    }

Yea, it's not the prettiest to look at. It's my first run through and it's literally based off an example from php.net so that's why there's all those lines concatenated together like that. I wouldn't normally do that. Nor would I use heredoc...But something a little nicer, at least single quotes instead of double. Anywyay, with that you will be returned a nice array (in $results) that will show all the counts for browsers and such. Note, I did not take into account the browser major versions here in this example. Also note that I separately stored a pageview count on the document which does not require a map/reduce to retrieve. 

Now let's look at it deeper. There's a lot of good articles on map/reduce if you spend time with them, they should be pretty clear. Here is a good one. Then you can also look at the MongoDB Cookbook site's example. Also php.net's example. You'll see that you can use map/reduce for many things. Let's go over how I'm using it.

First, the map function. Pretty simple. In fact, you likely wouldn't do what I'm doing here. The idea of it is to basically grab keys and values for a given collection. Those keys should be unique. So in my case metrics.visitors are unique keys. They are also the values that I need. What this does is returns the values to a reduce function.

The reduce function. More complex, but it's all nice friendly JavaScript. Here you're just looping the values that are passed and simply counting some of them. As a disclaimer, my example could have probably been written a lot better and cleaner. I only loop once which is what I was concerned about mainly. The rest can be refactored later.

At the end of whatever you decide to do with all that data, you'll return your values. I'm returning an object here with all the counts. Here's what PHP gets back in $results:

array
  'browsers' => 
    array
      0 => 
        array
          'name' => string 'Chrome' (length=6)
          'count' => float 2
      1 => 
        array
          'name' => string 'Firefox' (length=7)
          'count' => float 1
  'operating_systems' => 
    array
      0 => 
        array
          'name' => string 'Win7' (length=4)
          'count' => float 3
  'languages' => 
    array
      0 => 
        array
          'name' => string 'en-us' (length=5)
          'count' => float 3
  'mobile_devices' => float 0
  'unique_visitors' => float 3
  'pageviews' => int 256

...And there ya have it. What I would do next is actually cache this data so each time I called the action, it didn't have to run the map reduce which could be quite expensive over time with a lot of data.

Cool note: In this example you see the $metrics = $db->connection->command(...) part? Run a var_dump() on $metrics. It will have some handy information for you. It could tell you about an error when it comes to parsing your functions (though I'm not sure how to actually debug things, sorry). It also will tell you if everything was ok and ran successfully. You may wish to check this before returning data. It's on my to do list myself. Also, it will show you how long the operation took which is very handy. You might need/want to index some fields and cache results based on how long things are taking.

Another note: With map/reduce you're actually outputting to a collection. So you're going to pick up your results with another query to that (temporary or not so temporary) collection. This changed in MongoDB version 1.8.0. You now have to specify that 'out' key in the command() call. Here's more information on that

Hopefully these snippets will be of some help to people. I didn't want to go too far in depth with explaining everything, I think there's other really good articles on that out there. My hope is that seeing an example, as it works within the Lithium framework, will be helpful.

Family Spoon and MongoDB

Posted by Tom on Fri, Mar 25 2011 08:15:00

Family SpoonI recently soft-launched my personal project, Family Spoon. You can go to it and use it, it works. It's a recipe sharing website for you, your family and friends. Basically it allows you to create recipes and share them (or keep them private). There's a bit more to it and the site will continue to grow over time. However, with just the sharing there were some considerable hurdles to overcome.

Database Schema
The juicy schema details? Look at the bottom for an example document structure, but for readability I'm not going to paste that here. I'll breifly go over the architecture a little bit. The site was developed in PHP using the Lithium framework. The use of MongoDB was critical to the way in which the site was built, especially for speed reasons. Each "recipe" on the site actually sits in a "page" collection in MongoDB. It's saved there by a "Page" model. Nearly all pages (except static pages and other form pages) are treated this way on the site. Not active, but built is a "blog" section for the site and that also uses this Page model and collection. Obviously, a recipe and a blog entry have very different information. It's due to MongoDB's schemaless design that these two "ideas," which are really the same thing (they are web pages with content), can exist in the same space. Both have titles, but one has ingredients and the other some body copy (and yes some more fields). So right there that saved a ton of time. You need to be aware of, but not meticulously design (and overdesign) your database schema and ensure it's properly normalized, etc. I can now, worry free, add a new model that extends the Page model to add a completely new section to the site and I know it's going to be stored in the same place in the same manner. Done.

In the past? With something like MySQL I would have many many tables to ensure I properly optimized and normalized things. Then when I wanted to add a new section, it would mean either adjusting, or if I did a really really good job, adding at least one more table for the new section. Now, what happens if (and I will) in the future I need to add more than just prep time, cook time, etc. on the recipe document? No problem! I can easily stick in another field and not even think twice. I can also remove them. Again, with other database, I'd have to watch and worry about schema changes. It would take much longer to not only build a site, but also change it. MongoDB makes maintaining a database for a site easy.

Furthermore and leading into the next section is datatypes. Along with database schema, we would traditionally have to think about exactly what kind of data was stored in these tables. Reserving too much is a bad thing and we're sworn off from things like BLOB, so you have to play this delicate planning game. Am I really going to have a user that has a recipe title more than 255 characters? Nah, no way. Really? Oh, well they can't. They just can't. The user will deal with it. The user will live. Eh...ok...But not great. We had to draw the line in the past, but no more! (except for maybe 4MB for now) Smile

The other big thing here with schema and data types is tagging. We love to tag these days. Photos, articles, people, etc. Now, recipes. Family Spoon allows you to tag your recipes on the site in order to be able to find them more easily in the future. These are custom user entered phrases that get stored as an array in the database. Before, we would immediately create a "tags" table to ensure the same word wasn't repeated God forbid. We would join a million times and a week later, we'd have our results. Ok, with memcached back seat driving we'd get there faster. So with MongoDB, we just store those tags in the natural manner we would expect. Again, done and done.

Search for RecipesSearching & Filtering
I'll focus on permissions and search (and filtering) next. First, search. The site features a search box at the top of each page and depending on what area you're in, it will search for different things (the search button copy changes to indicate this too). These are regex searches on the database. It was the use of MongoDB that allowed for this whereas FULLTEXT searches with MySQL wouldn't really work so well. Then again we have problems with searching tags and all those JOINs. No good.

So MongoDB is allowing Family Spoon to run a regex search on recipe titles and tags. It's extremely easy to adjust this to search for other things...Say, ingredients?? Then perhaps filter by ingredient? So let's take a detour. Say we want to add a feature to the site that shows a user only recipes that do not contain shrimp because they have an allergic reaction to that food. Done. It's extremely simple to build this query and have it working on the site. A query for MySQL would work with WHERE IN() and not and ... all that good jazz. Sure, but it's not just about the query. It's about being able to handle that tall order. If I was using MySQL for Family Spoon and wanted to show people recipes that did not contain shrimp...I would have to put all sorts of caching strategies in place and perhaps if a popular enough feature, would need additional hardware to support the increased load. In fact, I wouldn't do it. I'd use a search engine like Solr. Ok, so now something else to setup and configure. Yes, we know how to do it, but time, time, time.

Back to regex searches. Again, you simply wouldn't do it with MySQL. You'd setup a search engine. I may eventually need one for Family Spoon (for things like weighting), but right now, no. There's no additional search engine. It's just running queries on MongoDB. We're also filtering. We're going to roll right into permissions here with that.

Permissions & Access
So, we have filtering in addition to search because we can show "your" recipes, "public" recipes, or "your family" recipes. This is determined by the access that each recipe gets assigned by it's creator. So in addition to search, we're searching specific recipes. While searching for public recipes, we mean only return recipes that also are flagged as being shared with the public. When searching your own, obviously where your owner id is set. When searching or accessing your family recipes...This is a little different. We now have to determine exactly which recipes you have permission to see.

Hey remember that ACL thing in MySQL? Haha, yea you know the one where you have all those tables and numbers and JOINs? Kiss that one bye bye. The access rules are extremely simple for Family Spoon and without some awkward tree table to screw up, they're also more reliable. It's a faster query too. In fact, most the time just for loading the recipe page document you have all the information you need to determine access. So why would we want to then jump through a bunch of hoops to determine access? Be efficient...And should access be allowed, you also now have the data to display. One query. The user's data is cahced in the session. So... What more can I say? Yes, there's a bit more to it and not all situations allow for that. Family Spoon also uses Facebook and you can share recipes with your Facebook friends. So that's another call to somewhere else, but for the most part, we aren't really talking about a lot of strain on the database.

Recipe SharingFamily Spoon can get a bit more specific too. You can share recipes with specific individuals. Again, an array of ids is stored and a simple $in takes care of it. Or, returning the entire field and using PHP's in_array() function takes care of it. How many people can a recipe stored with? How many ids could be in there? A lot less than the number of rows that would be in ARO and ACO tables. So, we're efficient too when using MongoDB. Not just fast, but efficient.

The really "rapid" part that MongoDB helps us with is when it comes back to the code. Building forms and saving the data down to the collection is far easier than it would otherwise be with an ACL system inside a relational database. There's less being written to the database and less code to do it. Sure, your framework could give you a wonderful API for working with permissions, but the minute you need to deviate from that, you could be opening a can of worms.

Ratings & Voting
Along the same lines is something like ratings. On recipes we can have a star rating. Anything that we need to tally and aggregate or increment even. MongoDB makes this easy. So ratings alone aren't a show stopper for any database or system, but when you then say, "Ah yes, but I only want each person, user, IP, to vote one time only." Then you have to put your thinking cap on. You're going to end up with some more JOINs... Also, more code. You're going to have to make the query and check to ensure that some person isn't trying to vote twice. With MongoDB, you can use the $set call and simply keep adding these ratings to a field and make the key equal to the user id or IP and the value their rating. If they vote again, they aren't adding a new rating, they are simply updating their own. Ta da. Now we've reduced the number of tables, data, and code required to setup a rating system. I've actually posted another blog entry about this here.

Conclusion
So, for many reasons, using MongoDB has made building Family Spoon a much quicker process. Trust me, I actually did it both ways. Family Spoon has been completely re-built. While the other version never really appeared out in the wild, it did exist (and was online). Previously, Family Spoon was built using the CakePHP framework with MySQL for the database. So I can most definitely tell you the differences in the amount of code and time planning between the two versions. I rebuilt the site much faster not just because I knew what I was going to do the second time around, but also because I had less to think about when it came to the database schema. Yes, you need to be aware of "schema" and you can't go hog wild, but you also get more forgiveness and MongoDB works with you to solve your problems. It's very flexible. It's not something that you need to work around, it's something that you get to work with. Anytime that you have a situation like that as a developer, your day is going to be much more happy and productive.

Do I hate MySQL? No. Definitely not, years and years worth of use and relationship aren't easily erased. It's comfortable and it's familiar and just fine. In fac,t still preferrable under certain circumstances. Just not mine. Let's make this clear, I am the only person developing Family Spoon. Just one developer and, without using MongoDB and the Lithium framework, I can tell you that it would have taken me a lot longer to not only get the site online. It also would have taken longer to provide some of the advanced features that people will be looking for (filtering, searching, etc.). Both MongoDB and Lithium not only served me with rapidly getting the project up and working (with all the core functionality that I needed) but these technologies will also be serving me into the future with rapidly being able to grow the site and add new features.

Last, so this didn't interrupt your reading pleasure, this is an example document from the database. It's just an example and not complete, but I wanted to highlight the schema and how things like tagging and ingredients worked. Each ingredient is broken out and that's going to go a long way for filtering. Filtering without JOINs.

{
  "_id": "4d6564cb9bae6c1066000000",
  "created": "Wed, 23 Feb 2011 11:49:31 GMT -08:00",
  "directions": "Place the chicken in the crockpot. ...",
  "ingredients": [
    {
      "ingredient": "boneless chicken breasts",
      "quantity": "3",
      "measurement": "lbs"
    },
    {
      "ingredient": "milk",
      "quantity": "1",
      "measurement": "cup"
    },
    {
      "ingredient": "salt"
    },
  ],
  "modified": "Fri, 11 Mar 2011 09:46:31 GMT -08:00",
  "owner_id": "xxxxx",
  "owner_ids": [
    "xxxxx",
    "123456"
  ],
  "page_type": "recipe",
  "public": true,
  "public_rating": {
    "127-0-0-1": "4",
    "255-255-255-0": "5"
  },
  "published": true,
  "serves": "",
  "share_with_friends": false,
  "tags": [
    "chicken",
    "crockpot",
    "broccoli"
  ],
  "time": {
    "prep": {
      "amount": "15",
      "unit": "minutes"
    },
    "cook": {
      "amount": "4-6",
      "unit": "hours"
    }
  },
  "title": "Crockpot Broccoli Chicken",
  "url": "crockpot-broccoli-chicken"
}

MongoDB Local Admin GUI

Posted by Tom on Wed, Feb 23 2011 10:39:00

I'm very excited to see a few more admin GUIs for working with MongoDB out there. I'm on Windows so I decided to try MongoVUE. Works great! I really haven't used it for a lengthy period of time, but my first reaction is that it's really good. For the OS X users out there, I think MongoHub might be worth taking a look at. 

So how to connect? Well, if you're working locally, it should be pretty trivial...But what if you're trying to connect to a firewalled server? Tunnel time. However, MongoVUE (I don't know about MongoHub) doesn't have built in tunnel support. That's probably the only thing I don't like about it so far. Not to worry though, you can simply make an SSH tunnel using PuTTY (or KiTTY, which I like better). Here is a really good article on how to setup a tunnel, it takes no time at all. After you're done, connect to localhost on the port you defined for the tunnel and you should be set!

MongoVUE gives you several ways to view the data in your collection. You can use the "text" view which shows you a JSON string that you're probably used to see in the mongo console, but you can also choose a tree and table view which is probably what you're hoping to see with an admin GUI for any database. The only other thing I might like to see is an actual console within this tool so you can manually enter commands. While I don't see that (if it's there somewhere) I do see that it will return the commands used while browsing and performing actions with the GUI. This is nice because you can then copy and paste those commands to an actual terminal window or something. I'm definitely looking forward to seeing feature enhancments for this tool.

Creating a Simple and Efficient Rating System with MongoDB

Posted by Tom on Fri, Feb 18 2011 14:51:00

I wrote a previous article on using MongoDB's non-locking updates and it's $inc operator in order to allow for a simple counter that would create minimal impact on your database. Related to that, let's suppose we want to make a rating system. Say it's a star rating; however, it can be any kind of voting or polling system really.

So here's the trick. You're storing this star rating on a piece of content (a document) and you already queried that so you can display the data on the page. We're in NoSQL land so we don't use JOINs and we really don't want to query another "rating" collection of some sort. There's no point. So we keep under this document a "ratings" array. Now when you load the document you can simply count and divide to get your average with the language of your choice. You could also make another count() query to MongoDB but again we want one query.

So how do you store ratings? You want to do an non-locking update to the database and much like incrementing, we want the impact on the database to be low. It's not that MongoDB isn't an amazing database that is more than capable of handling these updates, even if you need to make a read first, but we want to be efficient. Let's assume we're making millions of queries and this is all adding up to save us more than a few pennies in hosting costs.

Ok, so let's take a look at the $set operator. It's a simple one, nothing magical. It's just some cleverness with what we are setting. Under this "ratings" array we're going to set each users IP address (substitute for user id if you have some sort of authentication system you'd like to rely on instead) into this ratings array with a value. 

$set will set the value or update it. So you don't have to worry about repeat voters! They simply will change their vote and not skew your results. If you don't want the users to be able to vote twice then you can simply disable the link to vote on the page by looking for the IP in the array of ratings. Which, yes, is a simple deterrent and the user could technically lift the URL to hit to make the request to vote again...But again, your results aren't skewed and if you need further protection you can come up with something else. 

Let's apply this to my favorite framework here, Lithium. Here's what your query might look like:

$query = array(
  '$set'=> array('rating.' . str_replace('.', '-', $_SERVER['REMOTE_ADDR']) => $rating)
);

$conditions = array('_id' => $some_id);

$result = Page::update($query, $conditions, array('atomic' => false));

Do you see the gotchya? It's the IP address. You can't have a key name with dots in it. When passed to MongoDB it's going to translate to object hierarchy. So you'd end up with something like: rating" : { "192" : { "168" : { "126" : { "1" : "3" } } } } ... No good. So if we replace the dots with dashes, underscores, or hash it, etc. then you'll end up with something like this instead: "rating" : { "192-168-126-1" : "5", "192-168-126-2" : "3" }

Now if the same IP address voted again it would simply change the value in the "rating" array on the document. This way a user couldn't skew the rating by contstantly clicking on a star rating widget, submitting a form, etc. Of course you can further limit the impact to your database and server by setting a cookie.

Now it's just simple math to get the average. If you're using PHP, you'll run a count() on the ratings array and then add up the values and divide. Simple!

What would you have to do alternatively? Well, you might store another field with all IP's or user id's that voted in addition to a rating field... But now you've disassociated users with their votes. You could also store a different field for each option and then use $addToSet to add the IP to each option. Great, but then you allow a user to vote once for each option. You won't be completely skewed, but you wouldn't be as accurate. What about MySQL and relational database land? Well, you probably don't need to hear about all the ways you can use JOIN and how many rows you'll be scanning through to get your results and what kind of indexes you need to make sure you build.

Hope you found this little schema example useful in a schemaless world.

Lithium Quick Tip: Incrementing

Posted by Tom on Thu, Jan 20 2011 14:13:00

As I was pointed out today in the wonderful #li3 channel, MongoDB has an $inc operator, awesome! One I didn't see before (there are oh so many handy ones). So how do you write this in Lithium land? Well because we like being efficient and all, let's also look at how to do this without first reading the record (ie. an "update()" call).

Model::update(
// query
array(
  '$inc' => array(
    'field_name' => 1
  )
), 
// conditions
array(
  '_id' => $id
), 
// last but not least
array(
  'atomic' => false
)
);

There you have it. To decrement you'd put -1. Now also note that in my case (and it's worth mentioning) the value I had to increment was within an array actually. To get at that field you can simply use dot syntax to jump down into it. So the query would be more like:

array(
  '$inc' => array(
    'field_name.value_within' => 1
  )
)

Putting it altogether you can in pretty much one line, one call, very nicely increment or decrement values. Since MongoDB is so wonderful this gives you a very easy way to keep say a real time hit count or something for content within your Lithium application.

Lithium Quick Tip: Displaying Dates

Posted by Tom on Fri, Jan 14 2011 10:34:00

So if you're using Lithium and MongoDB you can set in your model the $_schema property which will allow you to set data types for your fields. So things may look like this: 

protected $_schema = array(
  '_id' => array('type' => 'id'),
  'created' => array('type' => 'date')
);

With that, the "created" field will be of type "date" which will be a MongoDB date object. This date object isn't going to display properly when you got to print it back out in your view templates. You'll end up with some weird number value. The timestamp is actually within this date object that's returned under the "sec" key. So it would be $document->created->sec in your view template (assuming $document was your result from the database).

How about formatting to a pretty date? If you're coming from CakePHP or some other frameworks you may be spoiled and have a time or date helper. Lithium is pretty lean and leaves that as one of your responsibilities. Fortunately, alkemann has created some helpers that mimic some from CakePHP for Lithium. These include a Time helper. So in your view template, you can use:

echo $this->time->to('nice', $document->created->sec); 

That will output something more readable like, "Fri, Jan 14th 2011, 02:29" instead now. Don't have or want those helpers? You may be crazy, but you can also use:

echo date('Y-M-d h:i:s', $document->created->sec); 

However, I would definitely give those AL13 helpers a look, they are pretty nice. The important thing to remember here though is that when using MongoDB date objects, your timestamp is under this "sec" key. Of course, you can always store timestamp integers or string values, but then you're really crazy.

MongoDB Queries with Lithium: Part Three

Posted by Tom on Mon, Aug 16 2010 19:13:00

This will probably be the last part of the series, unless I find some other type of query that may not be so obvious or that may be new in a future release of MongoDB. This will also probably contain the most obscure and complex examples.  

Regex With $or
So, on with the good part first. You probably want to create some sort of search for content on your site, be it on the front end for visitors or the back for only admins...It's a common thing for a site and while it's always a good idea to use something like Lucene/Solr/Sphinx if you expect complex searches and lots of them...You can make a quick and dirty search with most databases such as MySQL via a FULL TEXT search. Well, MongoDB does one better and takes advantage of regex, which I posted about in the last article for this series...But this time we're going to use a new operator, the $or operator in conjunction with regex. This is new in MongoDB 1.5.3, so ensure that you at least that version.

What's the $or do? Well, the $or is important because if you make a query with regex or not and want to search multiple fields, you will have all results from both matches be returned. The $or keeps you from getting duplicate results. If you remember in part one we used $elemMatch when dealing with matching multiple conditions within objects. In this example, we aren't working with complex objects, we're just saying go find a regex pattern in the title or the body of some document. Then again, maybe you will need to also apply what you learned about $elemMatch in the previous article with something like this. Right, so get on with it.

// I'm making one call to MongoRegex() but you may need more.
$search_regex = new \MongoRegex('/'.$this->request->data['query'].'/i');

$conditions => array(
     'public' => true,
     '$or' => array(
               array('title' => $search_regex),
               array('directions' => $search_regex)
          )
     )
);

// Now you can put the query together (or define $conditions inline with the find() call)
$recipes = Recipe::find('all', array(
     'fields' => array('title', 'url'),
     'conditions' => $conditions
));

My head exploded at first, but it's quite easy to follow along with. A little tip? Reference the MongoDB documentation, make your queries in a console window and when you get back the results you want...Take the text you entered and print it out on your page where you're making the query. I just print out the guts between the parenthesis. Then with your $conditions array that you build, run json_encode() and also print that out on the screen. It's a good way to see how the array format is getting translated. It makes it a lot easier if you're not a human PHP->JSON translator. For complex objects it helps quite a bit.

So what does that do above? A little context. I have some recipes and I only want those marked as being true for the "public" field to be included in the results first and foremost. You may have similar needs with some sort of publish status. Then we're running a regex on the query string passed via POST, but we're running it on two fields. In my case I have the title for the recipe and then some directions. I may also want to search ingredients too. Of course ingredients is in an object for me, so hey where's that good ol' $elemMatch? Yay, more complexity.

Ok. So sticking with this...What is returned is exactly what you'd expect, any recipe that's publicly visible that has the query in the title or directions (case insensitive match).

Lithium's "like" array key will not work in this case for two fields. It will for one. So in other words:

// Will NOT work
$conditions = array(
     'title' => array('like' => '/chicken/i'),
     'directions' => array('like' => '/cook/i')
);

// WILL work, but won't give us what we're after
$conditions = array(
     'title' => array('like' => '/chicken/i')
);

Again, keep in mind we also need to use the $or operator for a very important reason. We don't want duplicate records in our search results. How about the other operators and other conditions? Sure! You can apply the $nin and $ne and all the other operators too. Those are a little different though. So where we put in $or as if it was a field, we have to use $nin and $ne like the other article explained. Let me write it out again all glued together with the first example here.

$conditions => array(
     'public' => true,
     '$or' => array(
               array('title' => $search_regex),
               array('directions' => $search_regex)
          )
     ),
     'title' => array(
          '$nin' => array('BBQ Chicken', 'More Chicken')
     )
);

You can probably start to imagine how you would build an "advanced" search where you allowed the user to fill out a few different form fields that narrowed the search down. Ok, so let's glue together a query that combines a little of everything that we've gone over in all three articles. This will be as if a visitor came to the site, searched for recipes that were shared with a specific family (let's say their family), had "chicken" in the title/directions, and was not shared with the public. So only a private chicken recipe currently approved and shared with a specific family (group).

$conditions = array(
     'public' => false,
     '$or' => array(
          array('title' => $search_regex),
          array('directions' => $search_regex),
          array('ingredients.ingredient' => $search_regex)
     ),
     'title' => array(
          '$nin' => array('More Chicken')
     ),
     'families' => array(
          '$elemMatch' => array(
               'family_id' => '4c59d5ec7f8b9a6279000000',
               'approved' => true
          )
     )
);

Confused? You will be, in this episode of MongoDB & Lithium...Right, so your queries can get rather complex. Hopefully you won't be confused after reading my article, but to be fair, that's complex even if you were to type it straight up in the console written as JSON. It's no different really, but don't think that you need to make a custom query with Lithium, your query can be represented as a PHP array which really helps you out with dynamic values.

The above query also looks at ingredients, which in my case are within an array. In a production situation I probably would build a more complex regular expression or possibly several. The query could end up even more complex based on what's posted along to my search method. This should really give you some ideas though. How does this impact the database? I'm not really sure. I haven't benchmarked this at all or anything.

The $type Operator
One last query for the road? Sure. This is a simple one and probably very obscure. MongoDB lets you query by data type. Wha?? Crazy right? I personally haven't found a real great use for this yet, but you never know. Here's an example:

$conditions = array(
     'title' => array(
          '$type' => 2
     )
);

That will return every recipe where the title is a string. The $type operator accepts a list of numbers that are assigned to a BSON type. You can see the list of types on the MongoDB documentation page.

That's all for now, hope you enjoyed!

1 | 2