Shift8 Creative Graphic Design and Website Development

Does the Twitter Firehose Really Matter?

Posted by Tom on Wed, Dec 28 2011 22:51:00

As I go through more and more research about data mining the social web, I keep coming back to this question. So far, I've been extremely thorough in my research and analysis of data. I assumed, like many, that getting every bit of data was one of the most important things to social media analysis.

I think I'm starting to go back on this thought. I'm beginning to realize that the Twitter firehose is overrated. It's actually a super brilliant gimmick is what it is. Imagine this, a $0.10 per thousand tweets. How much money must Twitter make off its own buzz? Wow. Reports of Google paying as much as $15 million to drink from the hose! That's incredible.

There are clear benefits to having all this information available to you. Absolutely. No question about it. However, everyone believes that they need it when that simply isn't true. Services like Radian6 and such will advertise that they have access to this magical hose. Does this make them the best social media monitoring and analysis service? No, far from.

Why? Why wouldn't having more data be better for accuracy? It simply paints a broader picture for you, it doesn't paint a more accurate picture. When it comes to accuracy, there's a lot to consider and I will be posting some blog posts (referencing back to some research I've been doing with various algorithms when I can) about just that in the future here. So that's a really long question to answer. Let's just start with the common objection. This is simply that if you can't see all of the data, then you don't know what everyone is saying and if the 80% that you don't see has negative things to say, while the 20% you can see has positive things to say... Where does that leave you?

Incorrect thinking. What Twitter returns to you is a sampling and do you (does any statistics buff out there?) really know what the probability is that you got the 20% of positive statements? Yea... Think about that.

Here's how Twitter works by the way. You can look it up in their API documentation even. What Twitter returns to you is the most popular tweets. If anything, these are the more relevant tweets. They should carry more weight than those that you might be missing, right? Simply because more people are reading those.

So for those two reasons, it's pretty easy to start doubting the value of the firehose for data analysis surrounding a particular subject matter. However... Yea, you knew that was coming. However, it all depends on the application. If you are trying to determine what people think about your brand, then the hose doesn't matter. If, on the other hand, you are trying to build a social media search engine... Then yes! Yes, that hose matters greatly! If you are trying to determine every single user's mood globally, then yes, it matters. However, if you're simply trying to get an idea for what is being shared about a very particular subject, then it doesn't matter.

Think of this analogy. A wine maker does not drink the entire barrel to get a taste for the wine, he uses a theif to syphon a tiny sample out to taste. This is probably the most important analogy you can ever take away when it comes to social media analysis, metrics, and statistics in general.

Second, the firehose actually comes in several flavors. Many people don't realize that. So guess what? Those using the firehose are still likely to miss data. Think about it. Think about how much data gets passed around the internet. Even for a particular search query (given that it's broad). There is going to be data that slips through the cracks. Here's another important thing to keep in mind. How many people out there are listening to every single tweet? I bet only our monitoring tool. So yes, your monitoring tool is influenced in one way based on all the tweets out there period (assuming you can actually get them all). However, the users of the internet do not see all of those tweets. So they are actually influenced in another way. I'm sure there's some clever math to explain what that means, but I hope you get the picture.

So since the hose comes in several flavors and is very expensive...I'm going to wrap up this blog post with one more question you may want to ask yourself. Is it worth it? Look at services like Gnip and Datasift. Any of the social media monitoring tools you see can use those services if they wanted. However, they will need to, in turn, charge the client for the cost. Looking at DataSift, it seems pretty absurd if you use their pricing calculator. You can collect 1 tweet per hour for $144.07 per month. What?! Ok, cleary that's never going to be, let's be a little more real, ok? How about 1,000 tweets per hour. That costs $216/mo. That's still too much. Did you know that you can get that many tweets (at least if not more) within an hour for free? You don't need the firehose at all. Ok, I'll go with an example that really displays why one might need the firehose. Call it 75,000 or even 100,000 tweets per hour. Ok, now this is beyond the standard Twitter API. How much does it cost? $5,500 to $7,300...More actually since I'm assuming one "DPU" isn't enough to handle that amount of data. Funny they are setup like a hosting provider with their sliders on the pricing page and how they price out and even have the cute "DPU" gimmick. Anyway, I do think they are offering quite a neat service.

Ok, so to the point. Is, let's say, $5,000 worth it to you to gather all of these tweets? Really? Is it really? How about all the other areas of the web that you have yet to gather data for as well? Or did you forget about those? Is Twitter the only thing that matters? You're so wrapped up right now on the buzz of "big data" and how important and coveted and rare this firehose is. All the fancy numbers and pieces of information out there whizzing by at as much as three quarters the speed of light (look it up). For what? Why? So you can pay too much money to still not get a complete picture of the social web? What's the value you're getting from all this?

So that's really the question to ask. When you start paying thousands and hundreds of thousands and millions in metrics...In such a short period of time...I truthly believe it's not worth it. Only so many companies can continue to pour that kind of money into this. So does the firehose matter (if you want to know what people are saying about "X")? I say no. What say you? Why? Leave a comment! Would love to hear back from some people and stay tuned for more!

Posted in Technology

[Back To Blog Index]