Archive for the 'internet' Category

Where’s my data?

Jim Benson once again wrote something very insightful - I am detecting a pattern here. Free Services Are Unaccountable Stewards. If you rely on others to safeguard your data and make sure it is accessible to you when you need it, there is always an SLA involved. A Service Level Agreement - think about it, and look it up for the services that you use. Sometimes it’s just implied (usually with systems that don’t require a login like Google), in other cases there’s some legal language that defines the responsibilities and promises - or lack thereof. For example look at the Terms of Service over at Wordpress.Com. Here’s one of my favorite excerpts:

…in no event will Automattic, its suppliers or its licensors be liable to you or any other party for any direct, indirect, special, consequential or exemplary damages, regardless of the basis or nature of the claim, resulting from any use of the Website, or the contents thereof or of any hyperlinked website including without limitation any lost profits, business interruption, loss of data or otherwise, even if Automattic, its suppliers or its licensors were expressly advised of the possibility of such damages.

Automattic are the nice folks who make Wordpress.Com available for free. Don’t get me wrong, there’s nothing wrong with this language - it’s a free service. What I am pointing out is the risk that you take with your data here. Same goes for Jim’s Gmail example. Or del.idio.us. These are all free and therefore you’re mostly on your own.

Personally I am rather uncomfortable with that. My Gmail account is almost exclusively used for throw away email (registrations at web sites and other things likely to just attract spam). My main email address (hohndel.org) gets you to a server running in my office (server is a loose description - it’s a Mac Mini). My Wordpress based blogs run on that same server. I control where the data lives. I control the backup schedule. And yes, if power is out or my DSL link is down then my servers are down, too. That’s the price I pay. But at least my data is safe. Let me rephrase that. At least I control how safe my data is.

Thanks for visiting!
I hope this was helpful - if not, please leave a comment and let me know why! Were you searching for something else? Did I miss an important aspect?

So many bots, so little time

The number of bots that are crawling my server is getting out of hand. A quick survey of the log files showed that about two thirds of all requests are coming from bots. Many are genuine (the nice folks at Bloglines or the billionairs at Google). But a lot are at least suspicious if not known to be evil.

Googling for the bot name (if given in the HTTP_USER_AGENT part of the request) gets you to many discussion threads listing many of the crawlers that you don’t want to visit your site (email harvesters, image harvesters, spam bots, etc) and many who are of unknown purpose (which in this day and age means that most likely you want to block them). Very interesting is this three part thread over at WebmasterWorld which discusses a few of the bots and more importantly good ways to get rid of them, especially those that ignore your robots.txt (and there are many other similar threads elsewhere).

I followed the consensus and decided to be a little more aggressive - a lengthy list of bots simply gets a Forbidden response from the Apache server. mod_rewrite is your friend.

Since I am blocking a most of the bots I notice two good side effects: on the one hand less clutter in the log files, on the other hand less traffic which means better response times for the people actually looking at my blogs (I had one bot pulling about 50MB worth of images over and over again from the site).

Email clients

I have used so many of them. The original Berkeley mail. Then elm, pine, vm (under Xemacs) and finally mutt. Those are all text console based and (at the risk of getting myself flamed here) are sorted in order of usefulness - with mutt clearly superior to the rest. They work exceptionally well if you don’t get a lot of HTML emails and if you don’t expect seamless integration of pictures, rich text documents and other attachments. Which, btw, until only a few years ago, meant they worked very well with the vast majority of email.

I also was exposed to the frightening class of gui-based email programs. The distressing Lotus Notes (which back then didn’t even speak the most basic Internet email protocols correctly - allegedly that’s fixed now). The utterly frightening Outlook Express. The omnipresent Outlook (which is not terrible as far as gui-based email programs go, but has all of their shortcomings that I’ll get to in a moment). Right now I use Entourage for work email - which in many ways is nicer than Outlook (for example, it runs on OS X and is reasonably well integrated into that which gives it a nice touch compared to Outlook), but in other ways worse (as it competes with Outlook, is from the same small software company in Redmond, WA, and still isn’t able to fully integrate with that same company’s Exchange server - how ridiculous is that? Entourage doesn’t understand MAPI and instead uses WebDAV to talk to Exchange - which simply takes a lot of potential features away).

And it’s sad to say, there’s a group of programs that’s even worse - the open source gui email programs (like Evolution, Thunderbird or Kmail). Why am I so negative? Well, they compete with Outlook and they don’t come even close. None of them can really integrate with the Exchange calendar (Evolution tries to but fails badly). None of them has a gui that’s even close to what Outlook or Entourage have to offer. They are slow (try using them with a 250MB mailbox under Exchange) and are simply hard to use - even allowing for the fact that gui-clients in general are bad for email…

Here, I said it again… so why do I dislike gui-clients so much when it comes to email? Simple. If you are dealing with a lot of email (and who isn’t, given the spam pandemic) then the number one task of an email client is to allow you to quickly sort, view and discard email based of a variety of criteria. Mail thread boring? Delete all emails in it. Mail author annoying you? Gone are his emails. Which other emails have I received from this person? Which emails where the subject contains the word “blog”?

Sure, you can do all of these with the gui programs. But that requires you to touch the mouse. Bzzzzt. Disqualified. If I get to an inbox with 400 new messages since yesterday evening (not unusual) I don’t have the time to keep moving from keyboard to mouse and back.

But let’s say for the sake of argument that there was a gui client that had a decent keyboard interface. That still leaves you with the problem that it will try to render all the stuff that people send you. Which is fine for the 5% of your email that you actually want to read in detail. And for the rest it is at best a waste (and with Outlook on Windows, often quite dangerous).

“But it’s so easy to use the gui clients!”, I hear you say. Yep, for the occasional or newby user. But once you spent some time with your email client (and again, this whole posting assumes that you get a serious amount of email - so you’ll be there soon enough) then all what makes the gui clients so easy to use at first now makes them even more annoying.

Yes, for people who love to send pictures or other embedded objects around to others, mutt is not as pretty. And the learning curve is steep. But I think it’s worth it. I use it every day for all my email at hohndel.org and just love it. Even though I read those email on Macs these days which means I’d have access to Mail.app - one of the better gui-clients out there. But a good text based mailer like mutt beats Mail.app for large volumes of email, any time.

Migrating from Blosxom to WordPress

So I decided to move from Blosxom to WordPress, first for my personal blog and then for my tech blog. And since I had about 550 postings and around 40 or so comments in my personal blog I needed a way to migrate my data. Googling didn’t find anything even remotely useful (the “import via RSS” suggestions simply lost too much formatting - things looked terrible, given how many pictures I have). Instead I figured I’d write a perl script that would do the hard work; pull all the postings and comments from Blosxom and import them into WordPress. Looking at the structure of the existing import scripts (and the fact that I know far less php than perl) I decided not to integrate this into WordPress but instead to insert the data directly into the mysql database. That should be fun. And amazingly it took not nearly as long as I feared!

Now I want to share what I learned with the rest of you, but the more I look at the script that I wrote, the more I realize that it is based on so many assumptions that it might be almost useless to anyone else. But then again, maybe it can help someone in a similar situation as a starting point. Writing it certainly helped me understand why WordPress doesn’t have an import function for Blosxom.

Here’s the fundamental idea of what I did

  • install WordPress on the target system. One assumption made in the script is that you can access the mysql database from a system that has the Blosxom files accessible in its file system.
  • set up the new blog. Depending on your needs you may have to find (or write) a theme that is similar to your Blosxom theme. In my case (the personal blog, not this one) the formatting of many of the postings was based on this being a fixed width theme of a certain width with certain classes defined in the CSS, certain margins set around different HTML objects, etc. So I started from something reasonably similar and then more or less wrote my own theme.
  • delete the default posting and comment, make any other changes you want (blogroll, etc) and set up all your categories (important - the script will fail if it finds a category that doesn’t exist).
  • back up your database
  • I mean it. Use the wp-backup plugin. Or do it manually in mysql. But back it up. Really. I restored this backup quite a few times while working around bugs in the script, typos in the blog postings, etc.
  • download the blosxomtowp.pl script.
  • read the script. Edit the variables at the top of the script. Look through the assumptions made. Here are the ones I’m aware of, but you really might want to read through the script and compare with your file system layout, posting structure, etc.
    • it assumes that you have shell access to the machine that blosxom runs on and that you can connect to the WP mysql database from that machine
    • it assumes that you use directories under the main blosxom blog directory for your category hierarchy - just as with using the “categorytree” plugin
    • it assumes that you use the “meta” and “metadate” plugin to set the date on your postings (but it’s easy to change this to use the file time stamp instead - I just haven’t done that)
    • it assumes that you are using the “feedback” plugin for comments (but I think “writeback” and some others have similar file layouts and formats)
    • it assumes that you have already created /all/ categories that you have in blosxom in your WP database
    • it assumes the database table layout in WP-2.0.5

    Figure out what else you want to preserve (assuming you have different plugins than I had). Figure out what you can live without.

  • you did back up the wp database, right?
  • go to the main directory of your blosxom tree and run the script on one posting
    …/path/blosxomtowp.pl misc/aposting.blog (note that I used “.blog” as suffix - for most people that will be “.txt”).
  • check your blog in a web browser. Did the posting show up? Does everything look right?
  • start debugging. wp-phpmyadmin was a huge help for me to see what went wrong in the mysql database
  • once this works for a few postings you can slurp all of it in (don’t forget to restore the backup, first, so you don’t get duplicate postings):
    find . -name \*.blog | xargs …/path/blosxomtowp.pl

I’m sure I’m forgetting a lot of things here. Please comment if you have additions, improvements, suggestions. The script is under the GPL, I’d be happy to accept fixes from anyone, but especially from people who actually are better at writing perl than I am (that’s not a high hurdle) and who can help me clean up the code.

Why bother?

Video sharing is the big thing. YouTube (even got mentioned on NPR this morning), Google Video, or Veoh all generate lots of uploads and downloads and in general enough interest that people are willing to throw venture money at them.

The question that I still wonder about is simple. Once you remove all the illegal content (even Bill Gates watches it… can’t link to the original article in the Wall Street Journal (paid subscription only… how 90s), but many other blogs are quoting the important part), what’s left that people will actually willing to spend time watching?

Personal content tends to be of widely varying quality. And the median is nowhere near “good”, if you get my drift. It’s like blogs (which no one reads, just like this one). Only worse as it is actually much harder to make a good video than to write a good blog entry.

So what’s the solution? In my mind it’s true on-demand (sorry Comcast) delivery of existing TV content over the internet. Veoh is said to be in talks with network television (and hopefully with Comedy Central, as what I really want is “The Daily Show” and “The Colbert Report”) to distribute their content.

That can create that base traffic and viewership that it takes to make a site interesting to advertisers. And that can finance the infrastructure that allows people to share the cute videos of their dog.

Tracking where you are…

Many people are concerned about the government tracking their whereabouts… others (like me) have started playing with a cool little tool that does just that - and even makes it accessible to anyone who’s curious and maps it on a Google-Maps background.

Plazes uses the MAC address of the router you are connected to in order to determine your location. They have a data base of thousands of locations that they recognize, and as a user you contribute to this data base every time you get to a location that they don’t have (i.e., where they don’t recognize the MAC address of the router) which you then get to enter / describe.

There are several things you can do with this, one is the trace of my travel included here

The next one is interactive (but some versions of Firefox AdblockPlus don’t like it) - click on “show recent” ;-)

An even cuter view (with the Google Maps that I mentioned above) can be seen on their site as where is Dirk.

Social networks everywhere…

One of the cornerstones of the so-called “Web-2.0″ is the social network. People meet online and share things. Photos (flickr), recommendations (LinkedIn), tags (del.icio.us), anything (box.net).

It’s an incredibly powerful concept - but with so many sites and so many communities I am beginning to wonder about the richness of each of these communities. It takes a lot of people, at least some of which share some interests with a user, before such a community will really be useful to such a user. And with so many communities trying to get our interest… I guess there are plenty of people out there on the net (more than a Billion I hear), so I shouldn’t be concerned… but then I’m concerned if I actually manage to benefit from all the communities that I am a member off.

If only I had more time…

Net neutrality in danger…

It is fascinating what businesses will do who can see their own business model fail in order to get a free ride on someone else’s business model.

The telcos have received something like 200 Billion dollars from taxpayers and customers to build out the broadband network (not the lame one we have in the US right now, where my 6Mbps DSL link is one of the fastest things around, but the one they promised Congress to build ten years ago that connects most households with 45Mbps glass fiber links). And we, the customers, pay a lot (if compared to other nations like France, Korea, or Germany) for those rather narrow “broadband” links.

But apparently all this isn’t enough. The telcos now want to charge content providers for use of “the cloud” as well. So not only does the end customer pay his or her ISP and telco, and the service provider pay their ISP and telco, they also want to charge for the passage rights in the middle - literally changing the Internet into a set of toll roads.

No wonder that Google is rumored to build it’s own “Internet”.

But if you look at it, this is just another continuation of a trend of companies who are trying to kill progress that threatens their business. Whether it’s Hollywood trying to prevent the VCR, the RIAA and MPAA trying to prevent a digital business model for media delivery, or the telcos trying to prevent growth of online businesses that threaten their voice quasi-monopoly - looking for ways to derail a competitor and disruptor is not a new thing to happen.

But the community that has formed around these online business and online use models must stand up and fight against attempts to derail innovation. If the telcos succeed that will be the end of the Internet as we know it today. And that has brought us so many valuable innovations (like open source, for example), that we clearly want to keep it free to all, as it is today.

Google’s subpoena is a sign of some of the downside of their server centric approach…

A lot has been written (like at the Mercury News or at Forbes) and said (for example at NPR) about the subpoena that Google refuses to follow, that asks them to hand over data on searches to the US government. Most articles wonder if the US government is overstepping its rights by demanding these data, but the Washington Post, among others, point out another angle of this story that I find equally interesting (and a little less scary for a non-citizen, living in the US, to discuss in his public blog (or for that matter, in any private phone call, as the NSA scandal tells us)):

If we follow Google’s vision of web based computing, where all your data resides on Google’s server and you use a browser to get access to your data. Whether it’s Gmail or Google Analytics - things that are inherently yours and at least some of which might be private, are suddenly not only not in your direct control, but more importantly are conveniently pooled at one place with similar data of many others for “easy” access to interested parties. Be it the government, be it hackers, be it a profit oriented enterprise looking for new ways to woo advertisers.

Yes, not having your data locally solves some issues. You don’t have to worry about backups (assuming they do a good job at the service provider - oh, btw, since you don’t pay for the service and don’t really have a contract for it - do you really understand how diligent they are about backups? - and have you noticed that for some off the services you cannot make a local backup of the data, your data, even if you were so inclined?). You can access your data from almost any computer almost anywhere, the services are actually quite functional, well designed and sleek (I’m bothered to admit that I am using Google Analytics myself for my personal blog and likely soon for this one as well - it’s functional and provides you lots of insight into your audience; much more than the freely available local analysis packages that I have found).

Still, I’d really prefer to have my data local. To control my own backup schedule. To not have to trust a corporation to keep my private and personal information just that, private and personal. I’m always amused when Microsoft Money asks me if I want to store my data on Microsoft’s website (well, it’s not really asking - it’s actually trying to force me to do that, as some functionality is restricted if you don’t want to hand all of your personal financial data, including credit card numbers and investment activities, to Microsoft…)

Call me old fashioned, but I like the illusion that my data is actually mine.

« Previous Page

FireStats icon Powered by FireStats