MongoDB: Dealing With Data That Gets Confused as Sub-Objects
So I came across an interesting challenge the other night. I want to store in MongoDB a bunch of URLs and how often they are accessed. Call it simple metrics for a web site. The structure for my JSON response that I want is like so:
{
'http://www.site.com/whatever/page.html' : 23,
'http://www.site.com/another.html' : 4
}
The problem is saving this into MongoDB field isn't possible in this structure. If you don't know why, it's because MongoDB will see periods as an indicator for sub-objects. So if I go to save that under a "urls" field in my collection I'll end up with { http://www { site { com/whatever and so on.
encodeURIComponent? Sure, but periods don't get encoded to ASCII equivelant of %2E and don't need to according to RFC. Even though we could probably replace them, what if we don't want to end up with % symbols everywhere? What if we have, instead of URLs, geo coordinates? Lat/lon pairs that are not strings would not pass through encodeURIComponent. We would need to cast as string and send through, etc.
Why not add slashes or other characters? Because, if you are dealing with URLs, those characters could be mistaken for valid. There's going to be limited options for you to replace a period.
So encodeURIComponent plus some replace magic is one possible solution (didn't really do it for me though). I prefer to base64 encode the values. Of course base64 isn't native to JavaScript, but thanks for PHP.js we have some pretty sweet functions ready to use. See here for a base64_decode() equivelant to PHP's. Of course, both the encode() and decode() functions for base64 also require a UTF8 encode/decode function. So, there's 4 functions in all that you'll need.
Here's what the values look like stored in MongoDB now (for a hypothetical "urls" field):
'urls' : {
'aHR0cDovL3d3dy5zaXRlLmNvbS93aGF0ZXZlci9wYWdlLmh0bWw=' : 23,
'aHR0cDovL3d3dy5zaXRlLmNvbS9hbm90aGVyLmh0bWw=' : 4
}
So how do you use this in MongoDB? Simple, you can place the code in any map/reduce or finalize MongoCode. You probably don't want to keep doing that over and over though in each of your files and if you work from the command line, it'll be a nightmare. So you can also save stored JavaScript in MongoDB! Then you can simply call the function as if it was native.
Here's a few sites with further reading on stored JavaScript:
- With Python (but you'll still get the idea)
- MongoDB Docs on Server Side JavaScript
Now when you run aggregation, say a group() query, you can decode the values back. You could also decode the values in any other language that has base64 decode capability like PHP. You could keep the PHP.js functions on the front-end and let someone's browser do the work as well.
How much time does it take for all this? What's the overhead? Well, I haven't benchmarked it extensively. I only benchmarked my aggregation process, but I can say that it didn't take anymore time really. We're talking fractions of a second. Admittedly the job only took 2 or 3 seconds anyway, but regardless if I was running the encode function or not it was the same time. I imagine a much larger job would see a noticeable difference, but also keep in mind that a larger job is taking time anyway. So if you're concerned about using this function for a query that you want to happen without a page load timing out...Don't.
I'm also interested in seeing if there's also compression functions that can be used to save data. LZW compression has been implemented in JavaScript and the patent has apparently run out on that so it's kosher to use. Keep in mind that base64 requires about 33% more space for the data. If you're trying to keep an effecient document size, it may not always be the answer. However, the values are definitely key safe and I imagine any other kind of encoding to avoid periods also adds size.
Advanced MongoDB and the Lithium Framework
So I've been neglecting my blog. I've just been extremely swamped. I want to redesign my entire personal site and blog here. I'm going to rebrand a little bit. I have to get some new business cards for SXSW and more... Anyway, before I do all that I figured I'd post something useful for everyone first.
I've been using a model class, though it could be more of a utility sitting in the "extensions" directory, that helps you work with MongoDB in more advanced ways. This includes some command line calls like mongoimport. Expect to see some more updates to this file as I have the import, but not export. So I'll have that added soon along with some other goodies.
So what can you do with this class? Well, let's start with the import command. You can now easily bundle a JSON file in your project's repo and have a command to set all that initial dependency data. I've also used it to import from a file that I had some other code build first because there were a lot of inserts and the import was faster. You could also say....use it along with a command class method to import from the Twitter streaming API. Pretty exciting right? If you saw the MongoSV keynote then you saw how the JSON feed that Twitter's API provides could very easily be used in the command line with mongoimport. Well, this method here makes it very easy for your application to do the same.
The other methods in this class allow you to work with stored JavaScript and execute arbitrary commands. A feature some of you may not know about is the ability to execute JavaScript in MongoDB. It's slower, and single threaded (does not -yet- use the V8 engine), but it's very convenient. In fact, it's how I've managed to pull off a few things like search engines and various machine learning algorithms right from within MongoDB. The stored JavaScript feature basically allows you to write your own functions to call in any future command. This class allows you to add and remove these stored functions in the database.
This class also lets you easily use MongoCode for map reduce, group, and other commands from external JavaScript files or just a string that you pass. External files are helpful because you don't need to write JavaScript inside PHP...It's cleaner and your code editor will pick up on syntax coloring properly. It's also just better for organization and maintanence.
I didn't set up a repository for this one, it's just a single file, for now. You can grab the code here. Be sure to watch the namespace up top and adjust for your application if needed.
Data Mining: Spotting Questions
Another feature that will exist in Social Harvest is question detection. I want to be able to determine and extract questions (perhaps even group and rank the most popular questions asked - one day) to present to a user. More and more companies are using social media (and blogs) for customer service these days. However, if you don't have someone hawking over with a tool like TweetDeck, then questions will fall on deaf ears. Actually, TweetDeck should build in question detection. It could even be built into the ActionScript code within their AIR client
It's not all that hard actually. Of course I say that without the disclaimer of accuracy. Let me rephrase; it's not all that difficult to detect a good amount of questions, but it is not fool proof. Why? Well, people don't always type with perfect grammar and puncuation for starters. Especially not on the internet and most definitely not on Twitter where one is limited with characters.
We can rely upon question marks naturally. There's very few cases where we see a question mark that doesn't indicate some sort of question (even if rhetorical). This single, simple, regex rule will get you more than half the questions out there - easy. I don't have exact numbers, maybe after I gather questions I can start to report on how many questions I've found that don't use the question mark. In my most basic of tests, I've discovered about 33%. However, that's totally inaccurate, do not assume anything.
Then we get into the more complex methods. Could you use a Naive Bayes classifier? Eh, yea...You'd be hitting on word frequencies for words such as; who, what, when, where, why, and so on. The 5 W's are the next best way to determine if a piece of text is a question or not. However, I don't think you need a Naive Bayes classifier to come to your conclusion. A more simple tree would do ya.
I came across an interesting research paper on the ACL that tries various methods. They do use patterns and do take into consideration all the obvious things we've gone over here; the 5W's and question marks, etc. One thing that surprised me with their testing is the accuracy and the performance of various methods seen in table 3. Again they are not saying that 94% of the questions out there always have question marks, but they are saying that the accuracy rate was 94%. That's interesting, what was that 6% in their sample data that was either missed or contained question marks that weren't for questions? I know it came from Yahoo! but I don't know what exactly...But I was also fascinated by their sequential and syntatic pattern matching. The sets were large, 1,314 and 580, but I don't think something that would take up a lot of disk space nor take very long to loop through. Again, table 3 impresses me with the accuracy of things. Still not quite as good as relying upon a question mark, but very close. The most important part of this is that you can get near the same accuracy with very faster performance without relying upon a question mark.
Why is this interesting? Well, back to the main problem at hand. We are now in a world with extremely poor grammar. We are limited by the length of a Tweet and as such you may not see a question mark where a question is implied. By combining both approaches I think you will end up with a pretty comprehensive question detection system.
So I'm working on it. So far I've had some good succeess, but I have yet to really test things out. I have no control for my tests so I can't determine accuracy. I'll get to it eventually. For now, I wanted to point out three examples that I've detected as questions.
- I just read an article about who has the best chance to beat Obama. Is anyone surprised that it is Ron Paul? #RonPaul2012
- Are you looking forward to a 20 year recession, or are you finally ready for http://t.co/xLmgNA29
- Why bother with a budget when you can always print more money?? We really lack leadership!!! http://t.co/xLmgNA29
These were all detected by my system to be questions. Can I fault the system? No. I think it did a good job...But you can probably very quickly determine which one would go unanswered by someone. #3 is pretty rhetorical, right? I suppose someone could comment back on it, but it's clearly rhetoric and figurative to illustrate this person's point.
Interestingly enough the system also caught #2. This is a prime example of when a question is implied but you see no question mark. I'm not even sure this person ran out of space for one...There just wasn't one. Yes, it's slightly rhetoric, the desire is that the viewer clicks the link to find out the answer. However, look at what's been keyed in on here. The "are you" parts. Certain comabinations of words explicitly mean question. If no question mark follows certain phrases like that, then it would be poor grammar or missing puncuation. There are rules in the English language (even if there are exceptions at times).
Then of course without question, no pun intended, #1 is a question. Relying on the question mark easily caught that one. However, "is anyone" would also suffice. It's unlikely to have "is anyone" as a statement. It might be an answer to a question, right? Who can read this blog? Answering, "That is anyone." ...But that's incorrect grammar. "Anyone can" ... or "That would be anyone." However, you can definitely run into problems with always relying upon these rules.
I think the accuracy levels of over 75% are acceptable. I think over a large data set the number of misses will be small and if presented to a user in the right way...Easily ignored. Don't forget that with good UI we can hide away mistakes and inaccuracies from the system or at least prepare the user to deal with them in a simple way. If you give the user the ability to, say, delete the possible question from their view. Then it takes all of a click and a second to remove an item that doesn't even appear that often. Let's put it this way, if you are presented with 10 questions from a collection of 1,000 tweets and 1 of those is really not a question...Would you spend more time clicking a button to remove the one error? Or would you spend more time going through 1,000 tweets manually to find all the questions?
So that's question detection - without any code examples. Simple, fun, very powerful and helpful. I wouldn't be surprised if you saw more tools provide these features in the future as "internet noise" grows.
Data Mining: Determining Gender
I recently had a talk with a person about collecting demographics from the web. He brought up some very good points and it was a nice conversation. One of the things that I keep going back to in my head days later is how he felt that a 33% accuracy rate on determining gender was terrible. Perhaps I wasn't clear enough about the fact that it was strictly from Twitter. I think I did mention that though.
Here's the deal folks. Getting demographic information from data mining the internet is a very weak game. Don't expect to be able to add up your findings for male/female (or even by location, etc.) and get 100%. That's plain silly and impossible. This guy was talked as if 70% was common. On the internet? Hardly. I actually do not know a good number to shoot for, but as I continue my research and build Social Harvest I will know what that is.
I believe 33% for Twitter to be very good. Of course Facebook is going to allow a greater deal of accuracy, I'm assuming 100% since Facebook actually asks for gender upon registration and displays with the basic info for each user. On the other hand, Twitter doesn't ask for gender. Additionally, you can set your name to whatever you like.
So the problem is that many times you'll have company accounts with no name. That's just the most basic example of why you can never have 100% or even 70% accuracy. If you want to expand it beyond Twitter and look at comments on blogs...You can quickly see that your success rate is going to go down the tubes real quick.
Now, my question is this: Why do all the social media monitoring services such as Radian6, Sysomos, etc. give you results that add up to 100% for gender? They're flat out lying to you. Talking to a friend a while ago he didn't realize this at first. He said, "I don't know, but somehow they just know. Maybe they pay extra for that." ... No, sorry. So I quickly went into the Sysomos demo and pointed out from the very first page of results where a tweet from a user marked as male was actually female. That or they had gender re-assignment and that tool really did know something we didn't!
Why do they do this? My theory is they are afraid to tell customers that they don't know. There's gray areas out there and it's my belief that we should be aware of them...Because by randomly choosing male or female, we're actually skewing the results. It's far better to say, "of the 300 we know about, this many are male..." than it is to simply lie about it. People are trying to target ads based on this data and it's horrible to knowingly be inaccurate. There's always a margin of error and that's a different story.
So how do we determine gender? Well, I can't exactly spill all the beans...But there's actually some very, very good ways to do so. I'll give you a hint. There's some free databases available to you out there from big brother. When I say big brother, I mean the US government. That said, here's the obvious challenge. People named "Pat" and "Sam" are going to also be gray areas just as much as people on Twitter who do not give a first name. You have to put them in the uncertain category as well. It's unfortunate, but you have to.
What about advanced methods? Well, sure there are a few. You can actually analyze the text that people post and determine if it's male or female by their writing style. You can also try to grab page colors to factor into your probability and even use something like Face API to try to analyze profile pictures. I find photo recognition a very interesting thing. However, all of these clever attempts also are subject to a hefty margin of error. The profile photos are very small for Twitter and are typically poorly lit, etc. Additionally, many people don't even use a photo of themself. You also can't go based on someone's writing style or interests. You may have a user screaming "I love Transformers and Star Trek!" all over, but you really can't count on them being male. Additionally, do you realize the task that's now put before you? All this work just to determine the gender of a single user who posted a single status update on Twitter. Think about doing that thousands or even millions of times over. You want those results sometime this decade, right?
Even if money was no object and you had several computers do this processing to offset the time it took...Even if you also went off and searched Google for people's names to see if you can find additional supporting images... You are still subject to a margin of error. The time and effort...The sheer cost is not worth it.
So I say embrace the gray areas of data. Understand them and know why they exist. In the case of gender, it's simply the nature of the internet. No one requires you to register and expose your identity on the internet. That's the beauty of it. If you're trying to gather demographics on the net, please keep that in mind. If you can't accept and understand that, then you probably aslo don't understand the internet well enough to be advertising or working with it in a professional manner for a job of some sort.
I will continue my research and hope to find ways to improve things beyond 33%, I have decades of data and clever algorithms to help me do that, but for now...I'm quite happy to have the most accurate system for determing gender from Twitter...That I've ever seen at least.
Machine Learning in MongoDB
I'm very excited to be speaking at MongoSV on Friday, December 9th about some of my research on machine learning in MongoDB. I've implemented a naive bayes classifier within MongoDB and it works quite well. I will post a good write up (and slides) about that later.
I wanted to leave a blog post for people to sort of list out some of the things I'll be blogging about in the near future here. More than just machine learning algorithms, there's also some other data mining and indexing algorithms that I'm running within MongoDB that I want to discuss. While I'm not a mathematician or expert in statistics...I have been able to disect enough of that crazy math to get me where I need to be for my goals and apps at hand.
So the question keeps driving me is, what kind of creative things can one do with MongoDB? Mongo offers a lot of great features and the 10gen team is hard at work adding more and improving existing features (along with the all important performance improvements).
Some of the things I'll be blogging about in the future include running algorithms like the naive bayes classifier inside MongoDB as well as:
- Other text processing algorithms and methods such as stemming
- Internal, stored JavaScript within MongoDB and benchmarking it to determine when you may want to do it and when you don't
- Implmenting the nearest neighbour algorithm in MongoDB
- How about a search engine in MongoDB? What about stored JavaScript that is responsible for indexing other documents to later be searched for?
- Playing around with the new ability of multiple geo-spatial indexing per document and what that can do for us
- ...then maybe some more crazy stuff like trainable neural networks (farther down on my list of research items, but way cool)
So stay tuned! As always, I'm super swamped with work...But this weekend, I've managed to set some good ground work for easily storing JavaScript within MongoDB using PHP and the Lithium framework. I've also started playing around with the Porter stemmer algorithm within MongoDB. Likely for that, especially when using PHP...Using the pecl extension is going to be better. However, it's good to see what we can do in MongoDB.
I'll leave you all with some parting gifts here...The knowledge and research of others.
Here is a great article on stored procedures in MongoDB with PHP.


Social Networks