Iraq 2006: a bag of words

How to make sense of Wikileaks data? One way is visual analysis, as we see here, via Jonathan Stray of Associated Press:

Click the image for the high res version.

Stray and Julian Burgess created a visualization using data from December 2006 Iraq Significant Action (SIGACT) reports from Wikileaks. That was the bloodiest month of the war, and the central (blue) point on the visualization represents homicides, i.e. clusters of reports that are “criminal events” and include the word “corpse.” These merge into green “enemy action” reports, and at the inteface we have “civ, killed, shot,” civilians killed in battle. Stray tells how this was done, with some interesting notes, e.g.

…by turning each document into a list of numbers, the order of the words is lost. Once we crunch the text in this way, “the insurgents fired on the civilians” and “the civilians fired on the insurgents” are indistinguishable. Both will appear in the same cluster. This is why a vector of TF-IDF numbers is called a “bag of words” model; it’s as if we cut out all the individual words and put them in a bag, losing their relationships before further processing.

As a result, he warns that “any visualization based on a bag-of-words model cannot show distinctions that depend on word order.” (Much more explanation and detail in Stray’s original post; if you’re interested in data visualization and its relevance to the future of journalism, be sure to read it.)

Thanks to Charles Knickerbocker for pointing out the Stray post.

Mindcasting

Jay Rosen made a rich Tumblr post about mindcasting and Twitter. Mindcasting is Jay’s term for his posting style – where his goal is to have a high signal to noise ratio… and he’s a very active conversation engine. This post has notes on the form… e.g.

The act of building an editorial presence in Twitter by filtering, processing and structuring the flow of information that moves through the medium using one’s follow list, journalistic sensibilities and individual right to publish updates.

Also “It’s true that mindcasting is a pretentious term. People have always told me that certain things I do are pretentious. Every occupation has its hazards, right? What saves mindcasting from being totally so is that it’s an alternative to an even more pretentious notion: lifecasting.” He ends with a great Julian Dibbel quote:

It may begin as just a seed of an idea — a thought about the future of online media, say — tossed out into the germinating medium of the twitterverse, passed along from one Twitter feed to another, critiqued or praised, reshaped and edited, then handed back for fleshing out on a blog, first, and then, perhaps, in a book. It’s not that tweet-size sparks of insight haven’t always been part of the media ecosystem, in other words. It’s just that Twitter now has given them a vastly more exciting social life.

Read Jay’s whole post, my excerpts here don’t do it justice. Just registering my affinity. I really like the idea of diving into the information flow and working it to accelerate its quality. (Wondering if I should add Tumblr as yet another venue for writing/blogging/conversation.)