Shift8 Creative Graphic Design and Website Development

Auto Tagging Content

Posted by Tom on Fri, Jan 21 2011 10:07:00

It's 2011, I firmly believe that we should have systems smart enough to automatically tag our content. Our tweets, our FaceBook wall posts, our blog posts, etc. There's even face detection APIs available out there. So surely getting context and keywords from a copy should be easy, right?

There's a few major players in this arena and sure enough there exists some free APIs for getting keywords and context from copy. Reuter's OpenCalais, Alchemy, and Yahoo term extractor are all great services to use. However, how current are they?

I have a new project where I want to auto tag content (I also plan to add auto-tagging to my Minerva CMS), but there's a few issues. First, the content is going to be very very recent so these services may not have had a chance to pick up some of the terms. For example, "iPad" doesn't come back when using the Alchemy service. It doens't know that "iPad" like "iPhone" (which does come back as a hit) belongs in the Technology category and should be a keyword. Obviously this will change as time goes on, but as of Jan 2011, it's apparently not currently in their index. Yahoo's keyword extractor does seem to pick up "iPad" interestingly enough. Maybe you have to use multiple services for the best coverage.

So since you've found my site, you may or may not know that I'm in love with PHP, the Lithium framework, and MongoDB. Well, I am. So I took the Alchemy API for PHP and converted it into a Lithium library. Basically this involved namespacing the classes, breaking them out into separate files, and making sure the could depend on each other via "use." I then added two methods in the API classes (there's a CURL class and a "normal" one) that help out a bit. One grabs config information set when you call Libraries::add() so that you don't need to set your API key everytime you go to make a call and the other method is just for convenience so you can call the API statically. It's really not completely static of course, I didn't go through and re-write the API, it's just a static wrapper. Now you can simply do something like:

use alchemy\AlchemyAPI;
$data = AlchemyAPI::call('TextGetRankedKeywords', array('The iPad will feature a slot for reading SD cards unlike the iPhone', 'json'));
var_dump(json_decode($data));

With that, you'd then get back a nice array of data from the service. Again, I just like calling things like this statically, but you could also say $alchemy = new AlchemyAPI() and yada yada yada. If anyone wants this Alchemy library for Lithium, let me know because right now I'm unsure if I'm going to use it myself. It wasn't a lot of work to modernize these classes, but it might save you some time. I'll probably also end up making something for OpenCalais and Yahoo work as well for Lithium.

But back to our issue at hand. How do I go about getting relevant keywords? I'm half ready to spend an insane amount of time trying to use an artificial neural network or something (cool fun fact, there's a few for PHP actually)...Perhaps keep a list/dictionary in MongoDB and search against it? I'm not sure. There has to be something clever that MongoDB can help out with. Hopefully someone out there might know and leave a nice comment. Smile 

For now, I hope that maybe I gave you a few ideas and some resources to go look at if you've never attempted this before (or even if you have and were unaware of a few of the mentioned services) and keep an eye out for some possible Lithium libraries for this stuff and definitely expect to see it in the CMS I'm working on.


[Back To Blog Index]