So: I've been really interested in data mining and natural language analysis lately. I've set up tweet mining on linode using MongoDB and that's gathering data as we speak, but in the meantime I wanted to get started with something a bit simpler.
I figure the best place to start analyzing Twitter data is with my own Twitter Archive. More information on finding and downloading your Twitter Archive can be found here.
After a short search around github I found Dangoldin's twitter-archive-analysis, which already did a lot of the things I was interested in, and also seemed like good introduction to some of the python data analysis tools I've been reading about lately, including numpy, matplotlib, and nltk.
I've made a few adjustments to his code. This is the first time I've ever forked a repo on github, so I'm open to feedback. I think I've made it easier for people of all OS's and experience levels to use. I've also added word cloud visualization with PyTagCloud.
Now onto the results!
Here are all of the tweets I've ever made distributed over time by month. It looks like I used to tweet a lot more in 2009-2010, which I will now call 'peak tweet'.
I wouldn't have guessed that I tend the tweet the most on hump day, but it looks like I do more tweeting on Wednesday than I do on any other day of the week.
This Twitter-wide analysis shows Tuesday, followed by Wednesday as the most popular days to tweet. (Although I've seen variations on this chart that shows Thursday and Friday as the most popular.)
I also tend to tweet in the late morning and afternoon, fairly typical.
This chart stacks the type of tweet by month. At first glance, it looks like I am a lot less original and a lot more social now than I was in 2009. While that may be true, Twitter didn't formalize RTs until the end of 2009, which probably accounts for the bloated original tweet count before then.
Here we see 'peak tweet' again, sometime between September 2009 and February 2010. I remember Twitter being a major creative outlet for me at the time. I've since found more creative outlets than I know what to do with.
Now onto everyone's favourite type of text visualization: Word clouds!
Here are the words I use most often when I tweet. I talk about heritage, Toronto and outer space.
I also like a lot of things.
PyTagCloud uses a fairly basic list of stopwords that it omits from word counts. I added internet-specific stopwords like www and http to the PyTagCloud english stopwords list and so can you!
Here are the people I talk at the most.
I played around with the built-in PyTagCloud colour scheme and font options a lot.
And here are the hashtags I've used the most. I mostly see hashtags as gimmicky or event-related. #builtheritage is the hashtag used for a monthly twitter chat on the subject, so I use it regularly.
I am a rookie and forever learning. If I've gotten something completely wrong I'd love it if you could send some constructive criticism my way.