Shift8 Creative Graphic Design and Website Development

Map/Reduce in Lithium for Visitor Metrics

Posted by Tom on Tue, May 24 2011 10:01:00

Update: So after I got a little further along with this very example in a real life project I realized that while it makes for a very simplistic illustration of map/reduce (that I personally found helpful when learning how to perform map/reduce), it is not a very good real life example. The reason being... Smile ...The document size limit in MongoDB. Doh! You couldn't store metrics like this. However, ignoring the purpose of this, you can still continue reading about how to perform a map/reduce within Lithium.

Original Blog Entry

I'll start off by saying I love MongoDB and map/reduce after putting it off for some time. I dreaded learning the map reduce functions big time. It turns out, it's not that bad. A friend asked me to explain it in 10 words or less. So I did. It's not really all encompasing of the features, but it's a real good example for what map/reduce can do for you.

Use JavaScript to identify/"map" data to loop it to aggregate/"reduce."

Ok, so that's 12 words technically, I cheated by adding slashes and combinging two words. It's also really poor grammar. Anyway, that's the idea. In this example, I wanted to collect information about visitors on a web app. Obviously I'm not a masochist, I'd use Google Analytics if I could...Sadly, I could not. So what to do? Well, we can use MongoDB to record all this data and then use map/reduce to get some totals.

I may eventually turn this into a Lithium library (especially because there's a good browscap and language detection class that I'm not illustrating here), but for now I'm going over things at a high level and focusing on the actual map/reduce process.

That said, imagine a data set like this:

"metrics": {
    "pageviews": 63,
    "visitors": {
      "192-168-126-1": {
        "ip_address": "192.168.126.1",
        "browser": "Chrome",
        "browser_major_version": 11,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      },
      "192-168-126-2": {
        "ip_address": "192.168.126.2",
        "browser": "Firefox",
        "browser_major_version": 4,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      },
      "192-168-126-3": {
        "ip_address": "192.168.126.3",
        "browser": "Chrome",
        "browser_major_version": 11,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      }
    }

Now, we have this "metrics" field where ever you like, but in my case on a document that contains some other information. Why not a separate "metrics" collection? We could and then we could also put in things like page URLs that were hit on the site to start getting analytic information about the pages on our site. In my case, I just wanted to get a sense for some high level information about my visitors. For now.

So the first thing here that you'll notice (and I've written about the $set operator before) is that each IP address is the key for each entry. The dots have been replaced with dashes so that it works as a key. Otherwise, I'd have a pretty deep object on my hands. Surprised

So each time a page is loaded the pageviews count goes up and the visitor's browser information is captured using $set so that if the user from the same IP address came back again with a different browser, it would update. My metrics would not be skewed. Yes, it's sad that we don't realize when/if the user actually uses two different browsers...More sad that we're likely counting entire office buildings as one user, but that's just how the cookie crumbles in this case.

Ok, so we have that data and we have some controller action in our Lithium project that's going to return to us an array that we'll pass to the view template to make some pretty pie charts. Why not pie charts? I love pie charts, they give everyone a sense of satisfaction that looking at numbers is really fun! ...Or something like that.

We'll dive right in. Here's the entire action I'm using with the map/reduce code. Note that Lithium's MongoDb adapter does not have any options for map/reduce in the find() or any other method. I may write something in the future for that myself if I end up doing enough of these (and I likely will). However, we can make straight up command() calls from it.

 public function metrics($url=null) {
        if(empty($url)) {
            return false;
        }
        
        $db = Project::connection();
        
        // construct map and reduce functions
        $map = new \MongoCode("function() { ".
            "emit(this.metrics.visitors, this.metrics.visitors);".
        "}");
        
        $reduce = new \MongoCode("function(k, vals) { ".
            "var visitors = vals[0];".
            "var unique_visitors = 0;".
            "var b_counts = new Array();".
            "var browsers = new Array();".
            "var os_counts = new Array();".
            "var operating_systems = new Array();".
            "var mobile_devices = 0;".
            "var ln_counts = new Array();".
            "var languages = new Array();".
            
            // loop all the emitted visitor metrics to aggregate some data
            "for (var i in visitors) {".
                // count browsers
                "if(typeof(b_counts[visitors[i].browser]) == 'undefined') {".
                    "b_counts[visitors[i].browser] = 0;".
                "}".
                "b_counts[visitors[i].browser] += 1;".
                
                // count operating systems
                "if(typeof(os_counts[visitors[i].operating_system]) == 'undefined') {".
                    "os_counts[visitors[i].operating_system] = 0;".
                "}".
                "os_counts[visitors[i].operating_system] += 1;".
                
                // count the primary languages
                "if(typeof(ln_counts[visitors[i].primary_language]) == 'undefined') {".
                    "ln_counts[visitors[i].primary_language] = 0;".
                "}".
                "ln_counts[visitors[i].primary_language] += 1;".
                
                // count the number of mobile devices
                "if(visitors[i].mobile_device == true) {".
                    "mobile_devices += 1;".
                "}".
                
                // count the number of unique visitors
                "unique_visitors += 1;".
            "}".
            
            // loop browsers counted and set for output
            "for (var x in b_counts) {".
                "browsers.push({ name: x, count: b_counts[x] });".
            "}".
            
            // loop operating systems counted and set for output
            "for (var x in os_counts) {".
                "operating_systems.push({ name: x, count: os_counts[x] });".
            "}".
            
            // loop languages counted and set for output
            "for (var x in ln_counts) {".
                "languages.push({ name: x, count: ln_counts[x] });".
            "}".
            
            // return the output
            "return { 'browsers': browsers, 'operating_systems': operating_systems, 'languages': languages, 'mobile_devices' : mobile_devices, 'unique_visitors': unique_visitors }; }");
        
        $metrics = $db->connection->command(array(
            'mapreduce' => 'projects', 
            'map' => $map,
            'reduce' => $reduce,
            'out' => array('merge' => 'mapReduceMetrics')
        ));
        
        $cursor = $db->connection->selectCollection($metrics['result'])->find()->limit(1);
        foreach ($cursor as $doc) {
            $results = $doc['value'];
        }
        
        // Get the total page views for this project
        $pageviews = Project::find('first', array('fields' => array('metrics.pageviews'), 'conditions' => array('url' => $url)));
        $results['pageviews'] = $pageviews->data('metrics.pageviews');
        
        return $results;
    }

Yea, it's not the prettiest to look at. It's my first run through and it's literally based off an example from php.net so that's why there's all those lines concatenated together like that. I wouldn't normally do that. Nor would I use heredoc...But something a little nicer, at least single quotes instead of double. Anywyay, with that you will be returned a nice array (in $results) that will show all the counts for browsers and such. Note, I did not take into account the browser major versions here in this example. Also note that I separately stored a pageview count on the document which does not require a map/reduce to retrieve. 

Now let's look at it deeper. There's a lot of good articles on map/reduce if you spend time with them, they should be pretty clear. Here is a good one. Then you can also look at the MongoDB Cookbook site's example. Also php.net's example. You'll see that you can use map/reduce for many things. Let's go over how I'm using it.

First, the map function. Pretty simple. In fact, you likely wouldn't do what I'm doing here. The idea of it is to basically grab keys and values for a given collection. Those keys should be unique. So in my case metrics.visitors are unique keys. They are also the values that I need. What this does is returns the values to a reduce function.

The reduce function. More complex, but it's all nice friendly JavaScript. Here you're just looping the values that are passed and simply counting some of them. As a disclaimer, my example could have probably been written a lot better and cleaner. I only loop once which is what I was concerned about mainly. The rest can be refactored later.

At the end of whatever you decide to do with all that data, you'll return your values. I'm returning an object here with all the counts. Here's what PHP gets back in $results:

array
  'browsers' => 
    array
      0 => 
        array
          'name' => string 'Chrome' (length=6)
          'count' => float 2
      1 => 
        array
          'name' => string 'Firefox' (length=7)
          'count' => float 1
  'operating_systems' => 
    array
      0 => 
        array
          'name' => string 'Win7' (length=4)
          'count' => float 3
  'languages' => 
    array
      0 => 
        array
          'name' => string 'en-us' (length=5)
          'count' => float 3
  'mobile_devices' => float 0
  'unique_visitors' => float 3
  'pageviews' => int 256

...And there ya have it. What I would do next is actually cache this data so each time I called the action, it didn't have to run the map reduce which could be quite expensive over time with a lot of data.

Cool note: In this example you see the $metrics = $db->connection->command(...) part? Run a var_dump() on $metrics. It will have some handy information for you. It could tell you about an error when it comes to parsing your functions (though I'm not sure how to actually debug things, sorry). It also will tell you if everything was ok and ran successfully. You may wish to check this before returning data. It's on my to do list myself. Also, it will show you how long the operation took which is very handy. You might need/want to index some fields and cache results based on how long things are taking.

Another note: With map/reduce you're actually outputting to a collection. So you're going to pick up your results with another query to that (temporary or not so temporary) collection. This changed in MongoDB version 1.8.0. You now have to specify that 'out' key in the command() call. Here's more information on that

Hopefully these snippets will be of some help to people. I didn't want to go too far in depth with explaining everything, I think there's other really good articles on that out there. My hope is that seeing an example, as it works within the Lithium framework, will be helpful.


[Back To Blog Index]