Shift8 Creative Graphic Design and Website Development

Data Mining: Determining Gender

Posted by Tom on Fri, Jan 27 2012 18:09:00

I recently had a talk with a person about collecting demographics from the web. He brought up some very good points and it was a nice conversation. One of the things that I keep going back to in my head days later is how he felt that a 33% accuracy rate on determining gender was terrible. Perhaps I wasn't clear enough about the fact that it was strictly from Twitter. I think I did mention that though.

Here's the deal folks. Getting demographic information from data mining the internet is a very weak game. Don't expect to be able to add up your findings for male/female (or even by location, etc.) and get 100%. That's plain silly and impossible. This guy was talked as if 70% was common. On the internet? Hardly. I actually do not know a good number to shoot for, but as I continue my research and build Social Harvest I will know what that is.

I believe 33% for Twitter to be very good. Of course Facebook is going to allow a greater deal of accuracy, I'm assuming 100% since Facebook actually asks for gender upon registration and displays with the basic info for each user. On the other hand, Twitter doesn't ask for gender. Additionally, you can set your name to whatever you like.

So the problem is that many times you'll have company accounts with no name. That's just the most basic example of why you can never have 100% or even 70% accuracy. If you want to expand it beyond Twitter and look at comments on blogs...You can quickly see that your success rate is going to go down the tubes real quick.

Now, my question is this: Why do all the social media monitoring services such as Radian6, Sysomos, etc. give you results that add up to 100% for gender? They're flat out lying to you. Talking to a friend a while ago he didn't realize this at first. He said, "I don't know, but somehow they just know. Maybe they pay extra for that." ... No, sorry. So I quickly went into the Sysomos demo and pointed out from the very first page of results where a tweet from a user marked as male was actually female. That or they had gender re-assignment and that tool really did know something we didn't!

Why do they do this? My theory is they are afraid to tell customers that they don't know. There's gray areas out there and it's my belief that we should be aware of them...Because by randomly choosing male or female, we're actually skewing the results. It's far better to say, "of the 300 we know about, this many are male..." than it is to simply lie about it. People are trying to target ads based on this data and it's horrible to knowingly be inaccurate. There's always a margin of error and that's a different story.

So how do we determine gender? Well, I can't exactly spill all the beans...But there's actually some very, very good ways to do so. I'll give you a hint. There's some free databases available to you out there from big brother. When I say big brother, I mean the US government. That said, here's the obvious challenge. People named "Pat" and "Sam" are going to also be gray areas just as much as people on Twitter who do not give a first name. You have to put them in the uncertain category as well. It's unfortunate, but you have to.

What about advanced methods? Well, sure there are a few. You can actually analyze the text that people post and determine if it's male or female by their writing style. You can also try to grab page colors to factor into your probability and even use something like Face API to try to analyze profile pictures. I find photo recognition a very interesting thing. However, all of these clever attempts also are subject to a hefty margin of error. The profile photos are very small for Twitter and are typically poorly lit, etc. Additionally, many people don't even use a photo of themself. You also can't go based on someone's writing style or interests. You may have a user screaming "I love Transformers and Star Trek!" all over, but you really can't count on them being male. Additionally, do you realize the task that's now put before you? All this work just to determine the gender of a single user who posted a single status update on Twitter. Think about doing that thousands or even millions of times over. You want those results sometime this decade, right? 

Even if money was no object and you had several computers do this processing to offset the time it took...Even if you also went off and searched Google for people's names to see if you can find additional supporting images... You are still subject to a margin of error. The time and effort...The sheer cost is not worth it.

So I say embrace the gray areas of data. Understand them and know why they exist. In the case of gender, it's simply the nature of the internet. No one requires you to register and expose your identity on the internet. That's the beauty of it. If you're trying to gather demographics on the net, please keep that in mind. If you can't accept and understand that, then you probably aslo don't understand the internet well enough to be advertising or working with it in a professional manner for a job of some sort.

I will continue my research and hope to find ways to improve things beyond 33%, I have decades of data and clever algorithms to help me do that, but for now...I'm quite happy to have the most accurate system for determing gender from Twitter...That I've ever seen at least.

Stop SOPA

Posted by Tom on Thu, Jan 12 2012 22:23:00

If you haven't heard about SOPA, then please go and read this or simply Google it to find a bunch of other great sites and material. It's a really bad bill. That's obvious enough...Or so you'd think. There are actually quite a few people in congress who don't quite realize the gravity of the situation. It truly is something that could threaten the internet all over the world - not just the US.

Many sites are considering a black out in order to protest. While it would send a very strong message, as an individual user what can you do? Or is there anything you can do without shutting down your site? Yes! For starters, signing petitions and writing your state's representatives. Of course you'll likely get a canned auto-reply via e-mail. Believe me everyone is well aware things by now, I just feel like with how many people keep talking about how they get a canned response that our cries are falling on deaf ears. I feel like the message isn't getting through.

So, yes, actions speak louder than words (and e-mails). Sites shutting down in protest will make a difference, but I think so can each and every user on the internet as well. It's pretty common to have a profile picture on the internet, yes? Be if Facebook, Twitter, Google, or otherwise. Today I saw a friend on Twitter replaced his profile icon with a "STOP SOPA" in white letters with a black background. Big and bold. It was brilliant. I thought about it a bit and almost took the image and used it myself. Then I thought, well that might get confusing. Which is ok I suppose, it really could help drill in a point...But I then thought about a band. A band like one might wear in mourning. The truth is, if SOPA is successful, it would essentially kill the internet in many ways. So I designed a black band with "SOPA" on it. The word stop wouldn't fit diagonally and reproduce well at smaller sizes. I don't think it was necessary either given everyone signifies a black band/ribbon with mourning...And black itself not being the color of joy. I believe it does a good job and it still allows me to keep a good portion of my profile picture. My last reasoning for this is also that even if your profile picture is shown at a tiny size (like certain areas on Facebook and mobile devices) a black diagonal ribbon across any proflie image (regardless of legible print within it) could easily become a universal symbol for stop SOPA. A simple black square covering up everything would also work, but I just feel it doesn't look as pretty. Smile

So if you don't want to go all out and black out your profile completely, you may want to consider doing the same. I'm providing a 150x150 pixel transparent PNG image that you can easily overlay on your profile picture. That should save you some time and trouble. I would like to make a generator as well, but I don't have the time right now. I may do so later. Though, I'd also be concerned about my server getting pegged if I did that. It would have to be a very shareable generator so that more people could easily get it on their site to help host and distribute load.

Anyway, you can download the image here and it is also displayed below.

Posted in general

Does the Twitter Firehose Really Matter?

Posted by Tom on Wed, Dec 28 2011 22:51:00

As I go through more and more research about data mining the social web, I keep coming back to this question. So far, I've been extremely thorough in my research and analysis of data. I assumed, like many, that getting every bit of data was one of the most important things to social media analysis.

I think I'm starting to go back on this thought. I'm beginning to realize that the Twitter firehose is overrated. It's actually a super brilliant gimmick is what it is. Imagine this, a $0.10 per thousand tweets. How much money must Twitter make off its own buzz? Wow. Reports of Google paying as much as $15 million to drink from the hose! That's incredible.

There are clear benefits to having all this information available to you. Absolutely. No question about it. However, everyone believes that they need it when that simply isn't true. Services like Radian6 and such will advertise that they have access to this magical hose. Does this make them the best social media monitoring and analysis service? No, far from.

Why? Why wouldn't having more data be better for accuracy? It simply paints a broader picture for you, it doesn't paint a more accurate picture. When it comes to accuracy, there's a lot to consider and I will be posting some blog posts (referencing back to some research I've been doing with various algorithms when I can) about just that in the future here. So that's a really long question to answer. Let's just start with the common objection. This is simply that if you can't see all of the data, then you don't know what everyone is saying and if the 80% that you don't see has negative things to say, while the 20% you can see has positive things to say... Where does that leave you?

Incorrect thinking. What Twitter returns to you is a sampling and do you (does any statistics buff out there?) really know what the probability is that you got the 20% of positive statements? Yea... Think about that.

Here's how Twitter works by the way. You can look it up in their API documentation even. What Twitter returns to you is the most popular tweets. If anything, these are the more relevant tweets. They should carry more weight than those that you might be missing, right? Simply because more people are reading those.

So for those two reasons, it's pretty easy to start doubting the value of the firehose for data analysis surrounding a particular subject matter. However... Yea, you knew that was coming. However, it all depends on the application. If you are trying to determine what people think about your brand, then the hose doesn't matter. If, on the other hand, you are trying to build a social media search engine... Then yes! Yes, that hose matters greatly! If you are trying to determine every single user's mood globally, then yes, it matters. However, if you're simply trying to get an idea for what is being shared about a very particular subject, then it doesn't matter.

Think of this analogy. A wine maker does not drink the entire barrel to get a taste for the wine, he uses a theif to syphon a tiny sample out to taste. This is probably the most important analogy you can ever take away when it comes to social media analysis, metrics, and statistics in general.

Second, the firehose actually comes in several flavors. Many people don't realize that. So guess what? Those using the firehose are still likely to miss data. Think about it. Think about how much data gets passed around the internet. Even for a particular search query (given that it's broad). There is going to be data that slips through the cracks. Here's another important thing to keep in mind. How many people out there are listening to every single tweet? I bet only our monitoring tool. So yes, your monitoring tool is influenced in one way based on all the tweets out there period (assuming you can actually get them all). However, the users of the internet do not see all of those tweets. So they are actually influenced in another way. I'm sure there's some clever math to explain what that means, but I hope you get the picture.

So since the hose comes in several flavors and is very expensive...I'm going to wrap up this blog post with one more question you may want to ask yourself. Is it worth it? Look at services like Gnip and Datasift. Any of the social media monitoring tools you see can use those services if they wanted. However, they will need to, in turn, charge the client for the cost. Looking at DataSift, it seems pretty absurd if you use their pricing calculator. You can collect 1 tweet per hour for $144.07 per month. What?! Ok, cleary that's never going to be, let's be a little more real, ok? How about 1,000 tweets per hour. That costs $216/mo. That's still too much. Did you know that you can get that many tweets (at least if not more) within an hour for free? You don't need the firehose at all. Ok, I'll go with an example that really displays why one might need the firehose. Call it 75,000 or even 100,000 tweets per hour. Ok, now this is beyond the standard Twitter API. How much does it cost? $5,500 to $7,300...More actually since I'm assuming one "DPU" isn't enough to handle that amount of data. Funny they are setup like a hosting provider with their sliders on the pricing page and how they price out and even have the cute "DPU" gimmick. Anyway, I do think they are offering quite a neat service.

Ok, so to the point. Is, let's say, $5,000 worth it to you to gather all of these tweets? Really? Is it really? How about all the other areas of the web that you have yet to gather data for as well? Or did you forget about those? Is Twitter the only thing that matters? You're so wrapped up right now on the buzz of "big data" and how important and coveted and rare this firehose is. All the fancy numbers and pieces of information out there whizzing by at as much as three quarters the speed of light (look it up). For what? Why? So you can pay too much money to still not get a complete picture of the social web? What's the value you're getting from all this?

So that's really the question to ask. When you start paying thousands and hundreds of thousands and millions in metrics...In such a short period of time...I truthly believe it's not worth it. Only so many companies can continue to pour that kind of money into this. So does the firehose matter (if you want to know what people are saying about "X")? I say no. What say you? Why? Leave a comment! Would love to hear back from some people and stay tuned for more!

Posted in Technology

New Year's Resolution: New Blog

Posted by Tom on Sun, Dec 11 2011 10:24:00

I love Croogo, it's definitely a very solid CMS. However, I'm not working with CakePHP any more really. So I think it's time I moved on. Fahad has done a wonderful job and I encourage you all to check out his work on Croogo if you get a chance. Especially if you use CakePHP or are looking for an amazing blog to put on a shared hosting account (where something like CakePHP easily runs, opposed to say Lithium because many hosting providers still, grrr, have not updated to PHP 5.3+).

Additionally, I never invested a whole lot into the design. While I wanted to keep my blog simple, clean, and heavily type oriented...Ah...Come on, I'm an SVA grad. I can do a bit better, yes? Haha, yea... So that's reason number two. Not that I couldn't make a new design under Croogo, but while I'm at it...Why not?

Last, I want to have another excuse to work on my CMS, Minerva. It's still in a very early beta state. Though I am using it for some things in production, I wouldn't reccommend that be the case for anyone else. It's fine, it works...But, it's confusing. I will be cleaning that up soon now that I have a more clear vision for it's design. Before, it was just kind of a collection of ideas and various directions. I now have more focus with it. Hopefully with the help of some people, it will be in a good spot soon.

Since I'm perpetually swamped and get involved in too many projects for my own good, this will take some time. So I'm just going to call this a New Year's resolution. Smile

Additionally, I'm hoping to also start posting more of my research on my blog and keep up with blogging. If I can get some more visitors, interest, and even carve out specific time for...Or somehow montenize ?? my blog, then I'll spend more time with it. I hate to throw ads on. I won't really. Period. I'm gonna promote people and companies, definitely...Like my love of Rackspace and MongoLab, but I won't throw ads all over.

Actually, if there's anything you all would like to see more of on my blog...Please be sure to leave some comments about that!

Machine Learning in MongoDB

Posted by Tom on Sat, Dec 03 2011 12:19:00

I'm very excited to be speaking at MongoSV on Friday, December 9th about some of my research on machine learning in MongoDB. I've implemented a naive bayes classifier within MongoDB and it works quite well. I will post a good write up (and slides) about that later.

I wanted to leave a blog post for people to sort of list out some of the things I'll be blogging about in the near future here. More than just machine learning algorithms, there's also some other data mining and indexing algorithms that I'm running within MongoDB that I want to discuss. While I'm not a mathematician or expert in statistics...I have been able to disect enough of that crazy math to get me where I need to be for my goals and apps at hand.

So the question keeps driving me is, what kind of creative things can one do with MongoDB? Mongo offers a lot of great features and the 10gen team is hard at work adding more and improving existing features (along with the all important performance improvements). 

Some of the things I'll be blogging about in the future include running algorithms like the naive bayes classifier inside MongoDB as well as:

  • Other text processing algorithms and methods such as stemming
  • Internal, stored JavaScript within MongoDB and benchmarking it to determine when you may want to do it and when you don't
  • Implmenting the nearest neighbour algorithm in MongoDB
  • How about a search engine in MongoDB? What about stored JavaScript that is responsible for indexing other documents to later be searched for?
  • Playing around with the new ability of multiple geo-spatial indexing per document and what that can do for us
  • ...then maybe some more crazy stuff like trainable neural networks (farther down on my list of research items, but way cool)

So stay tuned! As always, I'm super swamped with work...But this weekend, I've managed to set some good ground work for easily storing JavaScript within MongoDB using PHP and the Lithium framework. I've also started playing around with the Porter stemmer algorithm within MongoDB. Likely for that, especially when using PHP...Using the pecl extension is going to be better. However, it's good to see what we can do in MongoDB.

I'll leave you all with some parting gifts here...The knowledge and research of others.

Here is a great article on stored procedures in MongoDB with PHP.

Here is another geared toward Python.

An example of nearest neighbour in PHP.

The FoundersCard

Posted by Tom on Thu, Oct 06 2011 11:57:00

I can not begin to explain how beneficial this card is. What is it? It's a membership card that gives you some really sweet discounts on various business (and lifestyle) related expenses. Blows American Express' benefits out of the water. While you can't use it to purchase anything, it's an awesome card to have.

Is there a cost? Yes, there is. They are running a promotion right now ($249 per year locked in) instead of the $499. Is it worth it? Yes. I believe at the $249/year it's worth it. I would personally start to wonder at $500/year. It is conceivably still very worthwhile at $500, but it depends on how much travel you do.

Your biggest savings? Travel. Depending on the airline, you can save a good bit and get access to priority boarding, etc. Things you would eventually get over time if you carried a bunch of frequent flyer cards. However, it's the hotel savings that really help. There's some pretty massive discounts.

The FoundersCard site claims on some hotels as much as 60% off...But I went to the hotel's site and discovered that wasn't true. Perhaps the "normal" price was taken at the height of some season...But, there is still a real good discount of anywhere between about 10% and 25% (sometimes more). So when you go somewhere and stay for say 3 or 4 days...It's basically like getting a free night. If it costs a couple hundred to stay...Well then...You just covered your annual membership cost in one trip.

So yes -- the FoundersCard is very worthwhile. Additionally, there's discounts on things like AT&T wireless plans, Rackspace hosting, and a bunch of other stuff. Discounts on things like flowers and spas, etc. There's plenty of lifestyle benefits. Rental cars too.

Then they have networking events. Given that this is highly targeted at startup companies, most are tech related. So if you're in the internet industry, these are great networking events to attend.

How do you get it? Well, it's by invitation only. So you need to know someone who has one. Guess what? I have one! So you can get in touch with me if you'd like an invite. I'd seriously consider it and I would get on it really soon while they have the promotional pricing.

The thing that stinks is you can't see all the benefits until you've been invited. Once invited, you can review all the benefits and then decide to sign up or not.

Posted in Off-Topic, general

Migrated to Rackspace

Posted by Tom on Tue, Sep 27 2011 17:55:00

Well, Slicehost is coming to an end. I think many people are sad about that, if there aren't...There should be. It was an amazing hosting company. So amazing that mega hosting company Rackspace purchased them a while back. If you don't migrate yourself, they will migrate for you...To Rackspace! That's the good news.

I've been avoiding the inevitable migration from Slicehost to Rackspace... But over the past few days here I have done it. I love Rackspace just as much as Slicehost and would always recommend them to everyone out there.

What does that mean? Well, depending on how much bandwidth you run through, you will likely be paying a little less per month for hosting. The control panel does not have as many features as Slicehost's did...Though the design is a little neater if you're a fan...I'm a minimalist (w/o things being fugly), so I like Slicehost's manager design better...But Rackspace does have a cool API.

I've used many hosts over the years, for myself and for clients...You name it, I've likely tried it...Slicehost, Rackspace, VPS.net, MediaTemple, Lunarpages, AptHost, HostGator, GoDaddy (yuck - please stick to domain names only), and many many more... Even back in the day, geocities! Yea! haha. So at the end of a lot of banging my head on the desk, if you are out there looking for a good host, I would absolutely reccommend Rackspace. Best hosting company I have ever seen hands down. I really do bet I've seen (and worked with) more than your average bear too.

That said. Hosting. What is the hosting landscape in 2011? Are you on a shared host? Get off. Now. If you're a web designer/developer then you should be on a VPS by now. Cost was a pretty hard thing to get around, but Rackspace has a 256MB slice (Slicehost terminology, I mean cloud server size or whatever they call it) and it's going to run you around $12/mo. With bandwidth maybe a little more like $15 tops. I was around long enough to see shared hosting dip to like $7/mo. I have no clue what it's like now...It should be free. Regardless. Switch to a VPS. Sure, you'll need to setup the server from scratch, but there's enough tutorials out there and the only way for you to progress as a web developer is to tinker with a VPS.

When you let technology and your web server limit what you can do...You limit what you can learn and do as a web developer.

Rackspace has some really great servers with their cloud server offering. They also have cloud sites which as far as I'm concerned basically replaces shared hosting as we know it.

What else in 2011? Well, even in 2010, likely 2009 we have this really cool new thing PaaS. "Platform as a Service" ... Things like OpenShift from RedHat, CloudFoundry from VMWare (hey, they are down the road from me!), Orchestra.io, and a whole ton of Ruby services. These services let you deploy web apps (the cool word for sites when they do more than just show a web page) in the "cloud" (the cool word for having essentially mirrored copies of your site on multiple servers so perhcance one go down, your site is still visible by the world and scalability or in other words, "many hands make light work"). 

These services are great because you just run a command (I imagine it's not long before some IDE has it bulit in, if not already) and voila! Your site is out on the internet. 

The future is cool. Well, the present...Sorta. A lot of these services still can't quite get a grasp on the beast that is PHP. Orchestria.io by far is the best for PHP. The other PaaS' have varied support for PHP when it comes to MongoDB. They'll get there though.

So, just as I have migrated...This little blog post is a reminder that you too should probably take a look at your hosting situation and think about a little spring ... er late summer or fall cleaning. I can't tell, I'm all out of whack, where I live spring felt like winter and now late summer/fall feels like mid-summer.

MongoDB in the Cloud

Posted by Tom on Sun, Sep 18 2011 12:58:00

So I'm growing tired of configuring servers. I simply don't have the time...Between designing sites/apps and then actually coding them on top of all sorts of project management, dealing with clients, getting paid, etc. There's just no time with all the projects I work on. So I'm looking more and more toward various PaaS solutions (platform as a service). I've been looking at RedHat's solution as well as VMWare's CloudFoundry and Orchestra.io as well. Only Orchestra has support for PHP currently (aside from RedHat's service) as well as support for MongoDB with PHP. I'm sure I'll have a comparison/review/my two cents for those services later.

However, today I'm going to talk about two hosted MongoDB solutions that I've come across. When I was at the MongoSV conference the other year I met some of the vendors there and MongoHQ really stuck in my head. There was also another company there, Mongo Machine. They go about their pricing differently. Mongo Machine is more cost up front so I haven't tried it to be honest. I'm not sure I will either. If I'm at the point of putting in that much money, then I'm going to just host my own database on my own servers. 

A side note. Hosting your own MongoDB (single or cluster) is going to likely yield better performance, especially with your cost/performance ratio. Plus you gain control over what your server setup is like. So there's two reasons why I'd suggest one of these services. First, it's amazingly simple to setup and manage and you don't need to worry about scaling. So convenience is number one. The second reason is more of a scenario. When you don't need super performance because your traffic isn't as high or maybe you're not doing anything as intensive (lots of map/reduces, etc.), it's probably a very good idea to use one of these services. Think about your own personal website. It's likely that you can have a free database solution for your own blog or something. Pretty cool!

Back on point. I've signed up for and have created two databases on MongoHQ and also MongoLab. MongoLab either was not at the MongoSV 2010 conference or I didn't see them...But I like them. So I'm going to compare these two services because they are extremely similar in pricing models, interface, and everything. 

Getting Started
It took all of 5 minutes to setup both services. MongoHQ I had to enter a credit card number, MongoLab I did not.

Configuration
I'm going to talk about Lithium here since that's what I use for a PHP framework (and you should probably too, if I can be an advocate for a few seconds here). Setting up Lithium to use MongoDB on MongoHQ and MongoLab was easy...Once you know what your config array should look like and also once you realize that you have to set the default timeout from 100ms to something higher, like a few seconds. The port number has to be in the host key value. The login key has to be set as well as the password key. This process was identical for both services.

However... Here's an interesting difference. MongoLab has several options when it comes to "where" you're hosting your database(s). You can choose to use Amazon EC2, Rackspace Cloud, or Joyent. This is a major point to MongoLab over MongoHQ. The reason is because one of my gripes is with performance. This is mainly due to hostname lookups and such. If you so happen to be using Rackspace for your hosting (again, I'm going to be a fan for them) you can use the private IP that's within their network! This should (I haven't tried it because I'm on Slicehost and have yet to move my server) help with the timeout setting that you just had to increase in your configuration. I imagine the same goes for Joyent and Amazon EC2 when choosing those locations for your database as well.

MongoHQ uses Amazon EC2 exclusively and you do not get to choose. I'm not sure which region or how it's setup, but I imagine if you also use Amazon EC2, you may get a performance bump when using MongoHQ when it comes to connecting to the database.

Pricing
The pricing models are the same. They both offer a free tier, but, what you get for what you pay is much different between the two. Not incredibly different...Except for when it comes to the free tier. You can likely run an entire blog/personal web site off MongoLab for free because they give you 240MB for free while MongoHQ gives you 16MB for free. Both then have similar plans, but not identical. You're talking about $5/mo differences here and there depending on which tier you fall into between the two services. Nothing to worry about.

It's only when you get into needing replication that things start to get different. MongoLab gives you replication on their plans that cost money. MongoHQ has it available, but you have to pay $300/mo to get it. MongoLab you get it even with their $10/mo plan.

Both have backups, that's cool. Both seem to be monitored, etc. What I'm not sure about is if MongoLab has you on a dedicated instance. MongoHQ lets you know that you are when you hit the mid range to higher end plans. The only thing MongoLab has is their dedicated plan which has a variable cost and I imagine you'd need to get in touch with them to price that.

So I'm not sure what that all means, but it could definitely affect performance. Maybe MongoLab can shed some light on that...Or maybe bcause MongoLab has it all replicated, the dedicated instance per account isn't as important because they are scaling with MongoDB's features.

I have to say neither wins the pricing category. They are comparable, but if I had to choose...I'd say MongoLab because their free tier is better.

Features
I have to say that I think MongoLab is going to have better features here. 

Update: Anyone who previously read this section would have seen info about how MongoLab offers replication. Which it does, but I have been informed it is not replica-sets. Meaning the failover is not automatic. Replica-sets are offered on their higher tier plan. This is more consistent with MongoHQ. Not knowing much about the internals of both services, I can't say if one is better than the other when it comes to scaling.

Here's another important note. MongoHQ seems to be running on an old version of MonogDB depending on which pricing plan you choose. You could be on 1.6.x or you could be on 1.8.x. 1.6.x is a bit old considering 2.0 is now out. MongoLab uses 1.8.x for everything. The "micro" instance on MongoHQ runs a 32-bit instance of MongoDB where the rest are 64-bit. MongoLab appears to use 64-bit for everything, but I could be wrong. They don't explicitly state that anywhere. Do you "need" 64-bit? No, not for 16MB or 240MB of storage.

Both services have a nice interface for browsing and even editing documents in your database. I love both. They have import/export features and it's everything you'd want. I don't think either service is better than the other when it comes to your database browser. MongoLab has the whole dark theme thing going while MongoHQ has a light design...If that matters to you. The important thing to note here is that MongoHQ allows you to hook up any database to their GUI. You can't do that with MongoLab. It's minor, but a cool feature.

Both services also have a REST API. This is neat for mobile apps, etc. Situations where you don't have acccess to a MongoDB driver...Which I think is pretty rare, but you never know. However, what MongoLab's API doesn't seem to do (which I wish it did) is allow you to deploy new databases. Being able to setup new databases via an API might make for a very nice solution when it comes to certain applications. For example, you may wish to create a service where every user who signs up pays you and each of those users gets their own database for security reasons and also so you can track their usage so you can in turn, charge them money to recover your hosting costs. Automation on that, would be great. MongoHQ appears to let you do that with their API.

Both services show you your database stats as well. I think MongoLab presents them with a little more helpful info which is good if you're new to MongoDB. MongoLab allows you to profile things as well whereas MongoHQ doesn't have anything like that built into their GUI. I imagine you could code your own profiling tools within your app though.

Overages
There's "soft" limits or overages that you can run into with both services. They both seem to be fair with this, but you need to be on top of your databases and if you are moving out of one tier, you need to switch your plan to the next tier. However, it's important to note that MongoHQ only has a soft limit on their high end plan. This means your database will not accept any more writes until you upgrade. I can only imagine for (just) the free plan on MongoLab the same is true. Given that you don't need to enter a credit card, I'm not sure how they would allow you to just keep using more and more. You get charged overages with the other tiers until you can switch plans. 

Nothing to really worry about with both services, but I think MongoLab handles things a little bit more nicely in case you're one of those people who don't really pay attention to your database and how it may be growing in size.

Reliability/Performance
This is something I can't speak about. I haven't used either service long enough to know. It can be completely possible that MongoHQ performs better and is more reliable than MongoLab. That would ultimately be the deciding factor for me to use MongoHQ over MongoLab, despite the pricing and features. It's simply more important than a database stays up.

My Choice
Both are very comparable services and are great. The differences between the two are minor and unless you're familiar with MongoDB, you likely won't really know what the differences are. You may not even care.

I personally will continue to play with both services since they have a free tier, but I am leaning toward MongoLab. I think they are definitely a service to follow and use. If you have a personal site that you want to use MongoDB, then I'd suggest trying one of these services. If you're on shared hosting, you likely need to use one of these services in order to use MongoDB. 

I think they also currently have the edge due to four reasons -- listed in order of what's most important to me.
#1 Choice of EC2, Rackspace, or Joyent with private IPs for (hopefully) better performance
#2 Replication
#3 Their free tier gives you more storage space
#4 Version 1.8.x of MongoDB

Updates on the Way

Posted by Tom on Tue, Sep 06 2011 21:27:00

I really need to update my website. First, my portfolio section is way dated. I want to add reference to at least one more major website that I got the opportunity to work on while at ExpandTheRoom, but I also have a few other person projects I'd like to list under that section as well. I'm still trying to figure out how to best list all of my open-source contributions too. I may change up the "portfolio" and "projects" section... 

Aside from all that, I want to be more active in posting blog entries (which I always say) and I want to bring in my Twitter feed since I usually post some great links there. Also, I've been struck by the creativity bug and I just want to put more effort into design. I really have a minimalistic design here that I'm not happy with. The new expriments with web typography that I've been playing with can really help me out.

I've been using Croogo, which is a really, really good blog/CMS for CakePHP. However, lately I've been working with the Lithium framework. In fact, so much that I'm really getting out of touch with CakePHP. So it really makes sense to move my site onto Lithium...Likely, the Minerva CMS I've been working on.

I may also setup a Q&A type section as well...Specifically with regard to the Lithium framework (but open to all areas of the web that I may be privy to). I don't want to have some sort of open Q&A site, but I do want to provide more insightful blog posts in the form of a "Q&A" for the little niche of the internet I find myself in. Kinda like some of the tips/tutorials that you can find throughout my blog, but now with a little more organization/emphasis.

I've left my full-time job and have gone off on my "own" (working with some good people), so I'm a free man! I'm on the loose and I'm dangerous. So stay tuned for some updates!

Posted in general, Off-Topic
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8