» Archive for the 'Information Management' Category

Neat mailing list trick

Wednesday, November 16th, 2005

This was just posted to the Full Disclosure security mailing list.

I’ve recently started using the highlight feature in evolution to apply colours to incoming mail where the ’sender’ matches certain criteria - doing this lets me assign a pleasant (but obvious) colour to people I know and/or whose postings are interesting (respectively red and redorange), and a vile colour to those whose postings are silly/downright stupid (respectively forest green and lime green).

Doing this, I’ve found, gives me a great indicator as to the qualities of a thread - a large amount of either colour clearly indicates the general tone of the thread (and a large amount of both tends to indicate a ‘hot topic’). Suffice it to say that unless looking for a comedy moment in my afternoon, I tend to ignore those putrid green threads and head straight for a red.

So simple, so elegant. I’m definitely going to try it.

Don’t Call ‘em “Filters”

Tuesday, July 5th, 2005

Over at The Long Tail, Chris Anderson isn’t thrilled with the term “filters”:

One of the themes that I’m developing in the book is the notion that “a Long Tail without good filters is just noise.” But what are good filters?

To begin, I’m using the catch-all term “filters” (which I’m not crazy about; anyone got a better word?) to describe the tools that help you find what’s right for you in the massive variety of the Long Tail. The examples I use most often are search and recommendations from either people (be they influential bloggers or just friends) or software, such as Amazon-style collaborative filtering (”people like you bought…”).

I don’t have a problem with “Filters,” other than the fact that they imply some sort of absolute definition of Signal.

As an alternative, then, how about “lenses?” A lens is a device which allows you to focus at specific points along a dimension. I think that much of the benefit of the long tail is the idea that you can find (focus on) your specific interests among the offerings within the tail. Extending the metaphor, a Magnifying Lens lets you get a lot of detail at a particular depth, usually at the cost of focal depth (how much of the dimension is unblurred).

Extending the metaphor even further, a wide-angle lens can let you see more than you might have normally, but often at the cost of strange visual effects which don’t create an accuration depiction of reality (causative correlation, anyone?).

In case the horse isn’t dead yet, throw in the Fresnel Lense, which basically inverts the focusing process, letting a designated point produce a disporportionate amount of visibility (similar to Anderson’s “Pre-Filter”).

Works for me.

Stats++ Update

Friday, January 28th, 2005

A few weeks ago, since I felt like the available tools don’t really do it for me, I suggested I might develop my own traffic analyzer. I’ve been working slowly but steadily since then and have created something I’ve now titled “Stats++.”

Basically, what I discovered is that the most commonly-available tools such as awstats and Webalizer weren’t capable of answering the questions I have in a dynamic-page world. For example, unless I’m serving static html pages those tools can’t tell me things like:

  1. Which postings are people reading?
    It’s a lot of work to write an article and while I do it primarily for my own enjoyment, I’d still like to prioritize which drafts get attention based on estimates of which articles people find the most interesting.
  2. Which links bring people in?
    Most of my traffic comes when I link to sites with lots of readers, like BoingBoing or Bruce Schneier’s Weblog. If I’m building a readership, then I probably want to write on the topics which bring in not only the most readers, but the most readers who then return. Thus:

  3. Which articles do people return to?
    If I write an article that that produces a significant amount of return traffic, then I must have done something right the first time. This gives me a chance to look and see.

  4. Which articles do people mostly look at and never come back?

I’m now getting pretty close to initial release. I’m pretty much feature-complete with a tool that can answer those questions pretty well. I don’t have as much time-series data display yet, but that may happen this weekend (I’ve got some serious waiting room time tomorrow which means no Net access procrastinating unless I decide to read a book (currently reading L4yercake, which I highly recommend).

Maybe I’ll even post some sample pages if the mood strikes me.

It’s the Tools, Stupid!

Monday, January 24th, 2005

There’s an article over at Many-To-Many, “ The Innovator’s Lemma“, which caught my attention this morning.

There, Jay Fienberg wrote in a comment (emphasis mine):

I would argue that, ironically, the usefulness of the tagging systems in Flickr, del.icio.us, and Technorati is that these systems remove the “same freedom to classifying” already available on the web, and constrain tagging within a more traditionally controlled system.

Hmm…this sounds to me like maybe the problem isn’t that people can’t tag or otherwise describetheir data accurately enough, but rather that for the Average User, the tools have been largely non-existent until now.

Even if Flickr, del.icio.us, et. al. all eventually collapse under the weight of their own chaos, they will be remembered in much the same way that Lotus 1-2-3 is still remembered as the Original Spreadsheet–it wasn’t the first (That was SuperCalc), but it was still the first one to produce an accessible accounting tool and, as a result, awareness of spreadsheets to the masses.

Gopher may have been the first attempt at an Internet-scale hierarchical classification system, but these days, outside of fora like this no one has ever heard of it. The population of the Internet was only about 1% of what it is now when Gopher effectively died. Flickr, del.icio.us and even Wikipedia will have the honor of being remembered as the “first” in the public’s eyes, not because they were actually the first but because they were the first to offer it as a tool which was grok-able to the Average User.

The Average User doesn’t want a good tool–he probably lacks the sophistication to know if his tool is good or bad. What the Average User wants is the same tool as his friends are using. That way, when one of his friends figures out something clever, he can leverage that discovery with a minimum of effort.

So now that the tools are out of the box, the Kayak Metaphor is the only one that’s left (unless I want to come up with some lame metaphor about a car with a stuck accelerator, no brakes, and broken tie rods, and the only working feature being the Hi-Beam switch for the headlights, but I think that one’s been used too much already).

I tend to laugh at people who fret about people being able to Do As TheyPlease with technology for two reasons. First, because I used to be one of them; and second, because if people didn’t keep creating new problems, no one would pay us to solve them.

Book-ists, Web-ists and Abhorrence of Vacuum

Tuesday, January 11th, 2005

So I’m reading this relatively-spirited debate between Clay Shirky and Louis Rosenfeld about formal versus informal naming schemes (”Controlled Vocabularies” versus “Folksonomies”) as applied to social networking and feel like I’m having a flashback to ten years ago when I was a graduate student working on my Masters of Library Science.

The faculty and students quickly split into two factions, whom I always thought of as the “Book-ists” and the “Web-ists.” The Book-ists were committed to the idea that information should be stored in Books, which in turned should be stored in libraries. If you wanted access to one of those books, you should want it badly enough that you would come to a library, search through either a card catalog or a ludicrously slow mainframe terminal (which you probably had to wait in line to use), then go find the book on the shelves, hoping all the while that no one else had been just as motivated as you to acquire this bit of knowledge and beaten you to it, either checking out or stealing the book that you wanted. If it sounds like an inferior way to go about getting access to information, it’s because it was. Thus, the Book-ists effectively removed themselves from the debate over how to apply the librarian’s view of Information Management to the Web.

Meanwhile, over in the Business School, where I found myself doing some TA’ing to make ends meet by teaching Internet technologies, including the Web, to MBA students. They could have cared less how the information was organized at a macro level. They just wanted two things:
1) To get Their Stuff on the Web
2) To make some money in the process

Meanwhile, the Web-ists back at the library school were busy arguing about what approach everyone should be taking to classifying their data. I remember debates over how best to apply the META tags in document headers, or simply insisting that the Web was too chaotic to ever replace nice, tidy information tools like Gopher. The crew over at the business school, however, not knowing what they didn’t know, were hiring programmers and graphic artists (re-christened as “Web Developers”) and cooking up the dot-com boom as fast as they could type Business Plans.

So now the World Wide Web which was coming out of nowhere with Homepages popping up all over the place and anybody could just stick a link to another page with no consideration of whether it made sense or whether that Homepage would be there tomorrow or what the heck was going on. There was no way to find a homepage except to either take a guess (IBM? Maybe they’re at “www.ibm.com”) or to ask someone you knew if they had seen a particular page.

That was the vacuum. So about that same time, a couple of guys at Stanford began obsessively bookmarking pretty much any and every site they encountered using a roughly hierarchical set of bookmarks. They then published their bookmarks on their own web server as Yet Another Hierarchical Officious Oracle. Thus was born the Search Engine, and with it the End Of Relevance for the Web-ist’s debates over how best to catalog Web sites.

Suddenly, the definition of “good” and “bad” things to put in META tags shifted away from any attempt at rational definition and was instead defined by one simple question: Will this increase or decrease my pages’ rankings in search engine results? An entire industry sprang up devoted to software and or services to help get sites indexed by search engines, hopefully in a way which improved their ranking in searches for a particular term. The Information Management experts had missed their opportunity yet again.

Now, I learn that the debate over whether only trained experts can successfully catalog data is still alive and well. Personally, I think the whole argument is moot. As the past ten years have now shown, so long as people get halfway close to right in describing their metadata, the Search Engines will do the rest. To try to argue that only experts can successfully catalog data requires you to pretend that the past ten years never existed.

Blogging and the Javascript Paradox

Tuesday, January 11th, 2005

The Javascript Paradox is something I used to joke about back when I was heavily involved in Web development. At the time, there were a tremendous number of graphic designers who’d gotten a copy of DreamWeaver or learned just enough HTML to be dangerous and thereby been able to double or triple their salaries by christening themselves “Web developers.”

I usually encountered these people because they were in the midst of learning The Hard Way that creating attractive Web sites and creating useful Web sites were very different skillsets. So while my web efforts usually looked pretty bad, they tended to provide lots of functionality; theirs, on the other hand, looked great but didn’t actually do anything.

Usually, these beautiful sites would have some bits of Javascript coolness on them that provided some bit of dynamic functionality such as graphics that changed when the mouse moved over them, creating the appearance of interactivity, and I would always ask the the Web designer if he or she had written that bit of functionality themselves. “No, I copied it from this other Web site,” was pretty much the universal response (at least until DreamWeaver began to automate the creation of things like image rollovers, at which time they seemed to go out-of-vogue. Go figure.).

As best I could tell, no one ever wrote Javascript–it all was copied and pasted from other Web sites, and this was what I termed the Javascript Paradox: If everyone copies their Javascript from other places, then where does it come from? It doesn’t spontaneously spring into being, yet I could never personally find anyone who actually wrote any non-trivial Javascript. Now I know that there are sites where Javascript developers post their work, but once again, those sites seem to mostly contain re-implementations or adaptations of previous efforts, so I’m back to where I started and the Paradox still holds true.

More and more, blogs mostly comment on and link to other blogs, and I admit that I’m about as guilty as most. In some cases, this is quite defensible, if the purpose is to conduct a written dialogue or even just to provide some running commentary about that dialogue. I think that the recent discussion between Cory Doctorow and Chris Anderson regarding market forces and Digital “Rights” Management (DRM) is an excellent example of the cross-linked discussion creating a dynamic which I think probably wouldn’t exist in another format.

Nevertheless, I realized that last week, all I was doing was adding my own snarky comments about postings on Boing-Boing, but not really making any real contribution to the discussions. This got me to thinking…are blogs really nothing more than a prose version of the Javascript Paradox? So I took a look, and what I decided is that Blogs seem to fall into one of three categories:

  1. News Aggregators — Sites that pull together news items in a related subject area. Slashdot is a great example of an almost-pure News Aggregator.
  2. Content Generators — The rarest but most also the most useful blogs. These are people Adding Value by creating thoughts where none previously existed. Pure Content Generators are extremely rare since much of the time, some News Aggregation is necessary for the new content to make sense.
  3. Me-Too Sites — These are sites that aspire to be News Aggregators, but never find any news that hasn’t already been identified by a major News Aggregator. I’m constantly amazed at (and guilty of) this tendency. These are sites that comment on other blogs, but never generate anything that someone else might want to comment on. This is my fear–that as I become more and more steeped in blogs as my initial point-of-contact for information, that I will forget that there are other ways to find things that interest me.

I mean, really, how many blog entries do I really need to see pointing to another blog entry about Time Magazine decreeing 2004 the “Year of the Blogger?” You’re now simultaneously important and invisible. This is just like my life in the Information Security business; you’ll either get used to it or get out.

Thus, I am setting a public standard for myself: If I don’t add value by posting, then I won’t. Will this hurt my posting frequency? Probably. I have a lot of ideas, but not a lot of time to type them up. That’s what RSS is for–to efficiently notify people when new content appears on infrequently-updated blogs.

How do I count thee? Let me try the ways…

Wednesday, January 5th, 2005

Over on Boing-Boing, they’ve brought back their traffic stats and intruduced them, along with A few notes.

I was interested to see that they also use awstats as the basis for their traffic logging. While my stats aren’t nearly as impressive as theirs (No, I don’t have 200,000-300,000 readers, which is probably just as well since the bandwidth bill would bankrupt me), I do have enough traffic that my stats can be interesting.

My great frustration, though, is with the “flat” statistics that tools like awstats generate. I’ve got a number of log analysis tools all running and am finding that they produce pretty similar reports, but that none of them allow me to do the sort of reporting I’m interested in without some fairly absurd configuration and/or scripting gymnasitcs.

I’m a big fan of drill-down analysis–the ability to filter my traffic on a potentially arbitrary number of variables. For example, I might be interested in seeing if there is a correlation between browser type and traffic to a given page. This is simply not possible with “flat” tools.

I’ve done some pretty heavy-duty work on near-realtime log processing and analysis in the past (Think archiving millions of maillog entries/day or thousands of network connections/minute). Maybe It’s time to think about “scratching the itch” and writing something that does what I’m looking for.

My question to The World, then, becomes, “What sort of web traffic data analysis would you love to see but don’t?”