Twitter Archive Analysis

Using python to analyze and visualize your twitter archive.

View the Project on GitHub laurenarcher/twitter-archive-analysis

Visualizing my Twitter Archive

So: I've been really interested in data mining and natural language analysis lately. I've set up tweet mining on linode using MongoDB and that's gathering data as we speak, but in the meantime I wanted to get started with something a bit simpler.

I figure the best place to start analyzing Twitter data is with my own Twitter Archive. More information on finding and downloading your Twitter Archive can be found here.

After a short search around github I found Dangoldin's twitter-archive-analysis, which already did a lot of the things I was interested in, and also seemed like good introduction to some of the python data analysis tools I've been reading about lately, including numpy, matplotlib, and nltk.

I've made a few adjustments to his code. This is the first time I've ever forked a repo on github, so I'm open to feedback. I think I've made it easier for people of all OS's and experience levels to use. I've also added word cloud visualization with PyTagCloud.

Now onto the results!

MatPlotLib

Tweets by Month

Here are all of the tweets I've ever made distributed over time by month. It looks like I used to tweet a lot more in 2009-2010, which I will now call 'peak tweet'.

Day of Week

I wouldn't have guessed that I tend the tweet the most on hump day, but it looks like I do more tweeting on Wednesday than I do on any other day of the week.

This Twitter-wide analysis shows Tuesday, followed by Wednesday as the most popular days to tweet. (Although I've seen variations on this chart that shows Thursday and Friday as the most popular.)

By Hour

I also tend to tweet in the late morning and afternoon, fairly typical.

By Type of Tweet

This chart stacks the type of tweet by month. At first glance, it looks like I am a lot less original and a lot more social now than I was in 2009. While that may be true, Twitter didn't formalize RTs until the end of 2009, which probably accounts for the bloated original tweet count before then.

Day of Week, Month, Year

Here we see 'peak tweet' again, sometime between September 2009 and February 2010. I remember Twitter being a major creative outlet for me at the time. I've since found more creative outlets than I know what to do with.

PyTagCloud

Now onto everyone's favourite type of text visualization: Word clouds!

Most Common Tweet Words

Here are the words I use most often when I tweet. I talk about heritage, Toronto and outer space.

I also like a lot of things.

PyTagCloud uses a fairly basic list of stopwords that it omits from word counts. I added internet-specific stopwords like www and http to the PyTagCloud english stopwords list and so can you!

Most Common @replies

Here are the people I talk at the most.

I played around with the built-in PyTagCloud colour scheme and font options a lot.

Most Common #hashtags

And here are the hashtags I've used the most. I mostly see hashtags as gimmicky or event-related. #builtheritage is the hashtag used for a monthly twitter chat on the subject, so I use it regularly.

Feedback is always appreciated.

I am a rookie and forever learning. If I've gotten something completely wrong I'd love it if you could send some constructive criticism my way.