Shift8 Creative Graphic Design and Website Development

How About Just Reduce in Lithium?

Posted by Tom on Tue, May 24 2011 22:32:00

So after realizing my big problem with schema design for my little metrics example, I've changed the design a bit. Obviously if you store all this visitor data on one document, you're going to exceed the 4MB limit. So this was not just an example for me, this was to be actual code! Yikes! I'm glad I caught it a few hours later. So now imagine a schema where your "metrics" collection has a bunch of documents with all the same recorded information. Illustrated like so:

/* 0 */
{
  "_id": {
    "$oid": "4ddc78d27f8b9ada44000002"
  },
  "browser": "Firefox",
  "browser_ver": 4,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.2@example"
}

/* 1 */
{
  "_id": {
    "$oid": "4ddc78d97f8b9ade44000004"
  },
  "browser": "Chrome",
  "browser_ver": 11,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.3@example"
}

/* 2 */
{
  "_id": {
    "$oid": "4ddc78dd7f8b9adc44000001"
  },
  "browser": "Chrome",
  "browser_ver": 11,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.1@example"
}

Again this example is specific for some of my needs. I still want unique metrics on a per project basis in the system. So the _key field has a unique index on it. That simply takes care of that, whereas before I was using the $set operator, now I just save without worry. The slight difference is this time there won't be any updating of information if the user came back with a different browser later on. Like before the metrics aren't skewed because each "unique" visitor only has one entry per page/project tracked. Also like before, the downside is this isn't truly unique...It's more of a "per household" or a "per office" situation.

I now just have to make a very simple query to get all documents where the project_url is "example" in this case. This will eventually return a huge number of documents. Far too many to just pull up and loop using PHP each time I want an aggregate. So we don't need a map reduce now. We could use one, but it's overkill. We just want "half" of that...The reduce part. 

So we use the group() command and it kinda all works out to be similar in nature as the map/reduce. Again with Lithium, we want to get the connection using $db = ProjectMetric::connection(); and then we can run $db->connection->command() calls. Also don't forget the MongoCode class. We'll need that again for the reduce and finalize part of this.

Here's the juicy stuff.

$db = ProjectMetric::connection();

// We just need a reduce, not a map reduce in this case, it's actually a group() command
// We're also going to use a finalizer to format the data afterward.
$reduce = new \MongoCode("function(doc, prev) { ".
    // count browsers
    // first if the browser doesn't exist, set a key name for it with a value of 0
    "if(typeof(prev.browsers[doc.browser]) == 'undefined') {".
       "prev.browsers[doc.browser] = 0;".
    "}".
    // then increment
    "prev.browsers[doc.browser] += 1;".
    
    // count operating systems
    "if(typeof(prev.operating_systems[doc.os]) == 'undefined') {".
       "prev.operating_systems[doc.os] = 0;".
    "}".
    "prev.operating_systems[doc.os] += 1;".
    
    // count mobile devices... this is easier it's just one value
    "if(doc.mobile === true) {".
        "prev.mobile += 1;".
    "}".
    
    // count languages
    "if(typeof(prev.languages[doc.lang]) == 'undefined') {".
       "prev.languages[doc.lang] = 0;".
    "}".
    "prev.languages[doc.lang] += 1;".
"}");

// Make the output pretty, we don't want any spaces in key names for example...
// So we're taking those key names and setting them as values of a new "name" key.
// What if extract() was used or something? Those spaces would be an issue.
$finalize = new \MongoCode("function(prev) { ".
    // browsers
    "for (var x in prev.browsers) {".
        "prev.browsers.push({ name: x, count: prev.browsers[x] });".
        "delete prev.browsers[x];".
    "}".
    
    // operating systems
    "for (var x in prev.operating_systems) {".
        "prev.operating_systems.push({ name: x, count: prev.operating_systems[x] });".
        "delete prev.operating_systems[x];".
    "}".
    
    // languages
    "for (var x in prev.languages) {".
        "prev.languages.push({ name: x, count: prev.languages[x] });".
        "delete prev.languages[x];".
    "}".
    
    // mobile is different and is fine as is
    
    // return the data
    "return prev;".
"}");

// Run the command
$metrics = $db->connection->command(array(
    'group' => array(
        'ns' => 'project_metrics',
        'key' => 'project_url',
        'initial' => array('browsers' => array(), 'operating_systems' => array(), 'languages' => array(), 'mobile' => 0),
        '$reduce' => $reduce,
        'condition' => array('project_url' => $url),
        'finalize' => $finalize
    )
));

// Set the results, they'll be formatted how we want them to be
$results = $metrics['retval'][0];

// We can get unique visitors basically by the total number of documents pulled back
// because there is a unique index on the collection, It's on the "_key" field which
// takes the visitor ip and combines it with the "project_url" value. So there can't
// be duplicate entries so unique visitors is simply the total count.
$results['unique_visitors'] = $metrics['count'];

I commented it pretty well (honestly for my own benefit haha), but I'll briefly explain. The $db->connection->command(array('group'... part. We need to define "ns" which is the collection name, we also need to specify the "key" which in this case is the project url. This key is what to group the documents on.  So all documents with a "project_url" key of a given value are going to be grouped together. If we wanted to retrieve aggregate data for multiple URLs then this would come in handy. In this example, the "condition" key is going to kinda ruin the whole grouping thing because we're only going to return documents of the value we're grouping by. So one group always. Not several groups of data per URL.

The magic parts are the $reduce, initial, and finalize keys.  The reduce is actually another reduce function using the MongoCode class. So that's very similar to map/reduce. Same goes for the finalize function. The initial key is new. That essentially sets some default values for us to use. These will persist so as we loop we can easily keep adding on to them. In the most basic situation we have this "mobile" key set to 0. In the reduce function we're just incrementing it by one everytime the mobile value from the document (also named "mobile") is true. Simple. The others are also just counters.

The final results returned have array key names that have spaces in them. Ew. We don't want that in PHP. So I just made a simple finalize function that went through and moved the value (the count) to an array for each item to count that has a "count" key and a "name" key. The value for the "name" key of course can have spaces in it and there won't be any problems for PHP. It is possible to have spaces in the key names, but if we wanted to run extract() or something, there'd be issues.

That's about it. Another example. I'm not sure exactly how practical these examples are, but they do show you how you can make these operations in PHP when using Lithium. Once I wrap my head around things a bit more, I may try to give back a little and take a stab at making methods within the MongoDb adapter for these things.

On a final note. The group() command is designed to be executed in "real time" whereas map/reduce is more of a background task. So, back to practicality. This may still not serve my needs as the collection grows because the query could end up taking a long time. At which point I guess I'll have another map/reduce example to show for.


[Back To Blog Index]