Data Mining: Determining Gender
I recently had a talk with a person about collecting demographics from the web. He brought up some very good points and it was a nice conversation. One of the things that I keep going back to in my head days later is how he felt that a 33% accuracy rate on determining gender was terrible. Perhaps I wasn't clear enough about the fact that it was strictly from Twitter. I think I did mention that though.
Here's the deal folks. Getting demographic information from data mining the internet is a very weak game. Don't expect to be able to add up your findings for male/female (or even by location, etc.) and get 100%. That's plain silly and impossible. This guy was talked as if 70% was common. On the internet? Hardly. I actually do not know a good number to shoot for, but as I continue my research and build Social Harvest I will know what that is.
I believe 33% for Twitter to be very good. Of course Facebook is going to allow a greater deal of accuracy, I'm assuming 100% since Facebook actually asks for gender upon registration and displays with the basic info for each user. On the other hand, Twitter doesn't ask for gender. Additionally, you can set your name to whatever you like.
So the problem is that many times you'll have company accounts with no name. That's just the most basic example of why you can never have 100% or even 70% accuracy. If you want to expand it beyond Twitter and look at comments on blogs...You can quickly see that your success rate is going to go down the tubes real quick.
Now, my question is this: Why do all the social media monitoring services such as Radian6, Sysomos, etc. give you results that add up to 100% for gender? They're flat out lying to you. Talking to a friend a while ago he didn't realize this at first. He said, "I don't know, but somehow they just know. Maybe they pay extra for that." ... No, sorry. So I quickly went into the Sysomos demo and pointed out from the very first page of results where a tweet from a user marked as male was actually female. That or they had gender re-assignment and that tool really did know something we didn't!
Why do they do this? My theory is they are afraid to tell customers that they don't know. There's gray areas out there and it's my belief that we should be aware of them...Because by randomly choosing male or female, we're actually skewing the results. It's far better to say, "of the 300 we know about, this many are male..." than it is to simply lie about it. People are trying to target ads based on this data and it's horrible to knowingly be inaccurate. There's always a margin of error and that's a different story.
So how do we determine gender? Well, I can't exactly spill all the beans...But there's actually some very, very good ways to do so. I'll give you a hint. There's some free databases available to you out there from big brother. When I say big brother, I mean the US government. That said, here's the obvious challenge. People named "Pat" and "Sam" are going to also be gray areas just as much as people on Twitter who do not give a first name. You have to put them in the uncertain category as well. It's unfortunate, but you have to.
What about advanced methods? Well, sure there are a few. You can actually analyze the text that people post and determine if it's male or female by their writing style. You can also try to grab page colors to factor into your probability and even use something like Face API to try to analyze profile pictures. I find photo recognition a very interesting thing. However, all of these clever attempts also are subject to a hefty margin of error. The profile photos are very small for Twitter and are typically poorly lit, etc. Additionally, many people don't even use a photo of themself. You also can't go based on someone's writing style or interests. You may have a user screaming "I love Transformers and Star Trek!" all over, but you really can't count on them being male. Additionally, do you realize the task that's now put before you? All this work just to determine the gender of a single user who posted a single status update on Twitter. Think about doing that thousands or even millions of times over. You want those results sometime this decade, right?
Even if money was no object and you had several computers do this processing to offset the time it took...Even if you also went off and searched Google for people's names to see if you can find additional supporting images... You are still subject to a margin of error. The time and effort...The sheer cost is not worth it.
So I say embrace the gray areas of data. Understand them and know why they exist. In the case of gender, it's simply the nature of the internet. No one requires you to register and expose your identity on the internet. That's the beauty of it. If you're trying to gather demographics on the net, please keep that in mind. If you can't accept and understand that, then you probably aslo don't understand the internet well enough to be advertising or working with it in a professional manner for a job of some sort.
I will continue my research and hope to find ways to improve things beyond 33%, I have decades of data and clever algorithms to help me do that, but for now...I'm quite happy to have the most accurate system for determing gender from Twitter...That I've ever seen at least.
Machine Learning in MongoDB
I'm very excited to be speaking at MongoSV on Friday, December 9th about some of my research on machine learning in MongoDB. I've implemented a naive bayes classifier within MongoDB and it works quite well. I will post a good write up (and slides) about that later.
I wanted to leave a blog post for people to sort of list out some of the things I'll be blogging about in the near future here. More than just machine learning algorithms, there's also some other data mining and indexing algorithms that I'm running within MongoDB that I want to discuss. While I'm not a mathematician or expert in statistics...I have been able to disect enough of that crazy math to get me where I need to be for my goals and apps at hand.
So the question keeps driving me is, what kind of creative things can one do with MongoDB? Mongo offers a lot of great features and the 10gen team is hard at work adding more and improving existing features (along with the all important performance improvements).
Some of the things I'll be blogging about in the future include running algorithms like the naive bayes classifier inside MongoDB as well as:
- Other text processing algorithms and methods such as stemming
- Internal, stored JavaScript within MongoDB and benchmarking it to determine when you may want to do it and when you don't
- Implmenting the nearest neighbour algorithm in MongoDB
- How about a search engine in MongoDB? What about stored JavaScript that is responsible for indexing other documents to later be searched for?
- Playing around with the new ability of multiple geo-spatial indexing per document and what that can do for us
- ...then maybe some more crazy stuff like trainable neural networks (farther down on my list of research items, but way cool)
So stay tuned! As always, I'm super swamped with work...But this weekend, I've managed to set some good ground work for easily storing JavaScript within MongoDB using PHP and the Lithium framework. I've also started playing around with the Porter stemmer algorithm within MongoDB. Likely for that, especially when using PHP...Using the pecl extension is going to be better. However, it's good to see what we can do in MongoDB.
I'll leave you all with some parting gifts here...The knowledge and research of others.
Here is a great article on stored procedures in MongoDB with PHP.
Typographic Lockups
I haven't designed in a while. It's just been too crazy busy with programming web sites. I'm working on a project now where I get to design again! It's very exciting. I realize now I have a real good ability to see both sides of the fence here. I'm an educated graphic designer and took many years of typography classes (more than the required amount). So I know exactly what I'm doing with regard to setting type and I also happen to love it. Sadly, I settle for what the web has and it's limitations. I also know that if I go and create images for every single bit of text...I miss out on SEO and it's a real pain to update.
So we suffer when it comes to the web. We all do. Of course there's various forms of dynamic text replacement, some using Flash, others PHP (with GD library), and then of course don't forget the new possibilities with CSS @font-face. FontSquirrel and Google are two really good places to get some free fonts that are more than your basic web fonts.
That's great and all, but... We're still limited. Being creative takes a lot of work manipulating CSS and HTML. However, it is possible. The problem is that it's pretty much single serving. You don't always know how many words are in a piece of copy so it's hard to create re-usable styles for things like titles, subtitles, and pullquotes. For example, look at the titles for my blog posts. Sometimes there's enough words in the title that it spans several lines. When you compare that against setting type for print... You realize that you can adjust things to make it fit on one line if so desired. On the web, we have to live with what we get for the most part.
I'm trying to fix that. I'm working on a jQuery plugin that allows you to create beautiful typographic lockups and have those be a re-usable set of rules. So you can apply the same style to, say, all your blog post titles, but with a bit of intelligence. Things like if there are this many words do this, or break the line at the 5th word, etc. Think of it like programming working with design in unison. It's beautiful. See (and see above)?
Now, the best part is that this lockup works with 4 or more words. It'll just keept wrapping if more lines are needed. Every word including and after "post" in this case will be the same size and everything. The only thing required to output the above is a span element wrapped around the title. So when it comes to something like creating a template for a CMS or blog, there's no extra work beyond applying a class.
How does it work? Well, in the above screenshot, both the date and the title have a span element wrapped around them with a specific class. The jQuery plugin then basically applies a set of "typographic rules" to the copy. Things like how many words per line, etc. The date is slightly different, that just gets broken out so that the day, month, year, all live in their own span element. It's much more straight-forward than the type lockups.
The JavaScript then breaks up the copy and replaces it with new HTML that has each line wrapped with a div and each word wrapped with a span element. Classes are also present with all this marking each individual word and line so they can be targeted exclusively with CSS. For example: l_1, l_2, etc. for lines and w_1, w_2, etc. for words. Each line and word also have an "l" and "w" class on them allowing for a broader set of CSS rules. Of course, each lockup can have its own class and id values applied to it so you can further target specific lockups to apply different styles.
Once you get your CSS defined and what I'll call the "sequencing" of words per line, you're set! You can then copy and paste the styles and JSON settings for the jQuery method call and share them across sites and such if you want. It's just a tiny bit of configuration there to create beautiful and re-usable typographic lockups.
I'll be working on this a bit more to polish it and add more settings. I also have an option to wrap everything in quotes if you want as well for things like pullquotes. You'd simply add a "q" class to the element you applied the method to or you'd pass to the method an option for quotes set as true.
More to come soon!
Agile Uploader Now on Github
Agile Uploader is now available on Github! I've decided to release the source code for the project after a long while and careful thought. The reason? I simply can't continue to maintain it. The tool does exactly what I was after and quite well for my needs. While I still will use other upload tools, when it comes to needing to resize images that users upload (namely photographs, typically off SD cards), it does the job perfectly.
However, the tool was falling short of some (maybe 10%, or less, of the comments I received) user's needs. I briefly went over some of my research involved with building the tool before, but I don't really want to make anyone work hard to get from A to B. So I'm going to give everyone B and if you want to get to C...Then, by all means, go for it. 
You can find the project here on Github. Please fork it and feel free to leave feedback. I can't promise that I'll be working on it much more, but I will fix any major bugs. I hope that people can add features that will help others out there.
A note for what you're looking at...You are looking at a FlashDevelop project. In the normal "bin" directory, where the SWF file is output to by default, I also included all the demos for the tool. Most importantly is the agile-uploader-3.0.js file. This is the jQuery plugin that I built and bundled with Agile Uploader. This directory structure leaves a bit to be desired I know, but it should give you everything you need. Please reference the jQuery plugin along with comments in the source code of the Main.as file. The tool really works by Flash's ExternalInteface communication of Flash <-> JavaScript. You don't need to use jQuery, but you'd need to write your own JavaScript to use Agile Uploader if you didn't...Or if you wanted to write your own custom jQuery too of course.
You can work on Agile Uploader with the Flex SDK and FlashDevelop. The tool is built completely upon open source code.
Enjoy!



Social Networks