/dev/neal: 2008

Wednesday, October 15, 2008

The Forgotten Half of Computer Literacy

Gruber today noted that the New York Times is adding an API to their campaign finance data (bravo!). He also pointed to an article from 2006 titled "The programmer as journalist: a Q&A with Adrian Holovaty".

The interview is pretty concise and lays out how and why journalism can better be accomplished with help from software. Holovaty breaks down the journalistic process of collecting, filtering and disseminating information and how computer automation can be applied to each step, leading to better journalism. Near the end, he argues that all journalists should have at least some experience with programming, if for no other reason than just to know what is possible.

His points are persuasive, but they're also applicable to almost any field - not just journalism. I believe that everyone should at least know how to write software in the same way that everyone should be able to write in general.

Literacy describes the ability to read and write. When we talk about computer literacy, we usually think of the ability to use software - which I would liken to the "reading" part. We often ignore the "writing" part - the ability to make software. But writing software is incredibly powerful. If you write a page-turning novel, it inspires the imagination of everyone who reads it. If you write an informative article, all its readers are enlightened. Likewise, great software entertains and/or empowers every person who makes use of it.

The analogy holds up pretty well. No - it holds up too well - it's not an analogy at all! Writing software is writing. It might have an apparently arcane grammar, but making software is merely writing down instructions that a computer can understand. And the language is not more complicated than English. Have you seen English? It's the most bastardized, crazy language on the planet! The languages we use to talk to computers are incredibly simple - necessarily so. Computers aren't going to pick the correct meaning of a word out of a dozen possibilities based on context. They're not going to detect our sarcasm in step two of the instructions we give them. They need to be told exactly what to do in the simplest way possible, and programming languages are the way we do that. Consequently, programming languages have only gotten simpler over the years. Strong typing is going the way of the dodo. How often do you need to know what a pointer is any more? All the complicated things are getting hidden away in libraries or abstracted by the VM. There has never been a easier time to become a programmer than today.

I don't believe that everyone has to become a "programmer" any more than everyone has to become a "writer". But everyone should be able to write; a sentence, a paragraph, an essay, a script that their computer executes.

Tuesday, October 14, 2008

Exercising Git

At work I use SVN for source control, but I've been itching to try out Git. I also recently stumbled across Project Euler, a series of math problems intended to be solved by writing small programs. I think you see where this is going. So - as of a few minutes ago - my Project Euler solutions (such as they are) are up on my github account.

I am a complete Git newb, so I followed some instructions on setting up Git and a new github repository. Since I can see all the code on github, I guess it's working.

My Project Euler solutions have been very fun so far. You're allowed to implement your solution in whatever language you want. I chose Ruby because it is the most expressive language I've found.

My solutions involving factoring primes have been a mixed bag. I think I did ok finding the largest prime factor of 600851475143 (problem 3), but my solution to problem 10 (sum all the primes below two million) was an ugly brute force. I really should have known a better algorithm (I won't say it by name here, but it was referenced in one of the Dark Tower books).

Some of the problems have been quite easy, but I'm only up to problem 17. I've peeked ahead and it looks like they get pretty tough. Sometimes it feels like using Ruby is cheating - for instance, problem 16 is "sum all the digits of 2^1000". Pity the poor C++ programmer!

The tricks I have up my sleeve are mainly inject, memoization and recursion. Recursion is especially helpful in simplifying problems. Memoization can be invaluable when you're performing expensive math. Inject is just plain fun - who wants to write all that boring loop code?

One reason that people might shy away from Ruby in Project Euler is the perceived performance benefit of more low-level languages. That is a complete red herring. When I was writing assembly language in high school, I read Michael Abrash's "Graphics Programming Black Book". One thing I learned was that extremely optimized assembly language will only net you a 2x to 10x improvement over compiled code. "Isn't that a lot?" you say? No - it is nothing compared to the gains you can usually get from improving your algorithms - often orders of magnitude. Using an expressive high-level language like Ruby allows you to concentrate on the high-level algorithms. I was especially proud of my solution to problem 15. I first broke the problem down with recursion. When that ran for more than a minute I realized I was solving the same problem more than once so I memoized it. Now it runs in less than a second even on my super-slow EeePC. Of course, there are much more elegant solutions out there - my solution that I was so proud of was far from the best. It just goes to show how far a good algorithm can take you.

This turned out to be a rambling post - sorry about that. In conclusion, Git and github are great, Project Euler is fun and algorithms matter. Good night!

Thursday, August 28, 2008

Banish XSS Forever!

The two biggest classes of web site vulnerabilities are SQL injection and Cross-Site Scripting (XSS). The ever-vigilant programmer can reduce the occurence of these problems, but proper architecture can eliminate them completely.

The SQL Problem Has Been Solved

I couldn't stop myself from referencing the XKCD SQL injection example. But for coders using prepared statements or parameterized SQL, SQL injection is already a bad memory. The problem of unsanitized data getting into your SQL can be avoided by following one rule: Never put any data into SQL. Instead of this:

$sql = "SELECT id FROM usertable WHERE username = '".$uname."';";
$result = pg_query($sql);

You do this:

$sql = 'SELECT id FROM usertable WHERE username = $1';
$result = pg_query_params($sql, array($uname));

Now when a script kiddie pastes some exploit code into my account sign-up page, he can't drop my tables or read all my password hashes. Little Bobby Tables can grow up to live a rich life without incurring the wrath of any more schools.

The XSS Problem is the Same Problem

SQL injection happens when unescaped data is interpreted by your SQL parser. XSS happens when unescaped data is interpreted by the client web browser. It's really the same problem, and the solution is pretty much the same.

I said "pretty much" because in the case of SQL, all the main relational databases have added APIs to allow "data" to be sent seperately from the "query". I don't see browser makers doing anything analogous. But you don't need your database to support parameterized queries natively - you can use a wrapper library that does it for you. Likewise, web site developers can adopt web frameworks that enforce the seperation of markup and text on the server side.

DOM View to the Rescue?

Stay with me here. The "traditional" way to build web pages programmatically was top-to-bottom. First you send your http headers, your DTD, your opening <html> and <head> tags, etc. until you finally send the closing </html> tag at the end. We're treating the web page as a string of text. Just because it is text on the wire doesn't mean we need to treat it that way. We should treat it as a tree, just like the browser on the other end does after it parses all that HTML.

Anyone who writes JavaScript is familiar with the Document Object Model (DOM) that is the web browser's internal representation of the web page. The page is a tree, with the <html> element at the root, and every element or chunk of text as a node. I propose that we write web frameworks that build web pages as trees - never having the programmer output any markup.

This system would start out with a basic empty web page.

html
|-head
| \-title
\-body

Set the title (please don't mind the syntax of my psuedo-code):

head.title.appendtext("Hello World")

html
|-head
| \-title
|   \-textnode value="Hello World"
\-body

Add a footer

body.appenddiv(class="footer").appendtext("© 2008 Me")

html
|-head
| \-title
|   \-textnode value="Hello World"
\-body
  \-div class="footer"
    \-textnode value="© 2008 Me"

Add our main content:

maindiv = body.prependdiv(class="maincontent")
maindiv.appendh1.appendtext("Hello World!")

html
|-head
| \-title
|   \-textnode value="Hello World"
\-body
  |-div class="maincontent"
  | \-h1
  |   \-textnode value="Hello World!"
  \-div class="footer"
    \-textnode value="© 2008 Me"

And so on. When you're done, your framework generates the HTML or XHTML or what-have-you from the tree and spits it out to the client. No markup is ever generated by hand and text can only go into text nodes or attribute values. The markup generator will automatically make the "©" for the copyright symbol and knows how to escape quotes embedded in attribute values as well.

This is the solution I propose - not the sisyphean task of "remember to escape your strings". Maybe I'll build it.

Stack Overflow

I have been having so much fun with the beta of Stack Overflow - a new programmer help site by Jeff Atwood and Joel Spolsky.

If you're interested in the beta, you can get in here (scroll down a bit). Hopefully it will be live to the world soon. I expect scaling issues (not that it's had any yet, but I expect massive popularity).

Wednesday, July 30, 2008

In Praise of jQuery

At work I'm starting to weave jQuery into the interface to our job database. Until now my Javascript needs have been limited to a menu at the top of the screen. Now I am trying to create a time entry interface, and it is complicated.

Each employee needs to be able to say "I worked on task alpha from 8:00 to 10:10, beta from 10:10 to 12:00 and gamma from 13:00 to 17:00". It needs to be easy to use, give them a list of the tasks that they are working on to choose from and make sure that their times don't overlap. This is all to say that some more advanced scripting is called for. I hacked around with vanilla Javascript for a while, but soon decided to check out one of the free libraries available.

Enter jQuery, and I could not be happier. The core feature of jQuery is that you can select elements in your document by xpath or CSS-style selectors. This alone is well worth the price of entry. For instance, originally I had to do a "document.getElementById("timebar")", use ".getElementsByTagName("div")" on that, and then check each element's class as I looped through them to remove a class. Now I just write: "$("#timebar .selentry").removeClass("selentry")". That's an if statement, a loop and a lot of long DOM method names I didn't need.

jQuery gives you all sorts of additional nice things - most of which I haven't tried yet. It includes:

Easy creation and insertion of new elements into your document tree.
Add, modify and remove attributes.
Modify styles on an element without wiping out unrelated styles.
Easily add event handlers with cross-browser problems smoothed out for you.
Iterators.
Handle ajax requests easily.
Plugins to handle even more things. (I was able to easily drop the date picker into my existing date fields.)

I resisted using a Javascript library for a long time. I thought they were just for fancy ajax and slide-in effects. I also worried that they would force all my scripts to work inside their framework. With jQuery at least, my fears were unfounded. Even if you just use it to find elements in your document tree it will be extremely helpful.

Tuesday, March 25, 2008

Bug Zero

Bug zero is that the program does not exist. Fix that bug first.

The Legacy of Microsoft Office

I'm a little bit late to this particular dance, but I feel like I have to say something about it.

Last month, Microsoft released the specs for its old binary Microsoft Office file formats. Lots of people thought this would usher in a new age of competing office suites. It will not. Microsoft Office binary file formats are monumentally, hopelessly complex.

Joel (on Software) Spolsky is a veteran of the Excel team, and he wrote a wonderfully insightful post about why this is so. He writes:

A normal programmer would conclude that Office's binary file formats:
are deliberately obfuscated
are the product of a demented Borg mind
were created by insanely bad programmers
and are impossible to read or create correctly.
You'd be wrong on all four counts.

The crux of Joel's point is that the formats were: designed around hardware constraints two decades ago; grown to account for every new feature since; all while maintaining 100% backward compatibility.

Joel is absolutely right about how the file formats got that way, of course. That doesn't change the fact that they are absolutely terrible file formats today. The startling thing is that Microsoft's new XML-based file formats are really just as bad. Microsoft has gone through a lot of trouble to translate all the data structures into new XML equivalents, but they still have all of the cruft of the old formats. This is not a system that can continue indefinitely.

All software with a long enough life faces this demon eventually. Backwards compatibility is a feature. It is an asset to software users who have old data. As with any feature, it is a trade-off, and comes at a price. That price in Microsoft Office passed the "too high" mark for me a long time ago. Likewise with Windows. I am more willing than most users to move to different systems, but everyone has a limit.

The greater lesson is that the cost of backwards compatibility should be carefully considered. In the end, extremely backwards-compatible software is just... extremely backwards software.

Monday, March 3, 2008

A Dramatic Reversal

Microsoft's plan with IE8 was to make it use IE7's less-correct rendering by default, unless developers asked for "IE8 mode" explicitly. This was bad. They have now reversed their position. Money quote:

Now, IE8 will show pages requesting “Standards” mode in IE8’s Standards mode. Developers who want their pages shown using IE8’s “IE7 Standards mode” will need to request that explicitly

Wednesday, February 13, 2008

Maintainting Heirarchies in SQL

I've refined my thinking a little about the optimal way to keep track of tasks in a relational database. Here is the newer version:

CREATE TABLE task (
 id BIGSERIAL PRIMARY KEY,
 parentid BIGINT REFERENCES task (id),
 taskname varchar(50) NOT NULL
 CONSTRAINT name_not_empty
 CHECK (taskname <> '')
);
CREATE TABLE leaftask (
 id BIGINT PRIMARY KEY REFERENCES task (id),
 weight INT4 NOT NULL DEFAULT 1000,
 starttime TIMESTAMPTZ,
 endtime TIMESTAMPTZ,
 CONSTRAINT startbeforefinish
 CHECK ((starttime < endtime) or (endtime IS NULL))
);

I want my tasks to follow certain rules:

All sub-tasks of a given parent task should add up to 100% of the parent task.
A task can never have exactly one sub-task because - according to the rule above - the subtask would by synonymous with its parent.
All tasks that have no descendants - and none that don't - should have a corresponding "leaftask" record.
All tasks must have a final root that has no parent - i.e. no looping back on your descendants.

I want to build structures into the database that will prevent invalid states (as defined by the above rules) from occurring. Today I am going to tackle the problem of a task descending from itself. To prevent that case, I need what in PostgreSQL is called a trigger. I'm getting ahead of myself though. First I need a way to list the descendants of a given task. Since I want it to happen all in the database, I need a stored function.

Stored functions in PostgreSQL can be written in many languages. I'm going to write mine in PL/pgSQL, a sort of SQL-related procedural language, because it is included by default with PostgreSQL. (PL/pgSQL is intentionally similar to Oracle's PL/SQL, if you're an Oracle person.)

CREATE OR REPLACE FUNCTION taskdescendants(parenttaskid bigint, OUT decid bigint)
 RETURNS SETOF bigint AS $fun$
BEGIN
 FOR decid IN SELECT id FROM task WHERE parentid = parenttaskid LOOP
  RETURN NEXT;
  FOR decid IN SELECT * FROM taskdescendants(decid) LOOP
   RETURN NEXT;
  END LOOP;
 END LOOP;
 RETURN;
END;
$fun$ LANGUAGE plpgsql;

This function will return the ids (each as a separate row) of any descendants a given task might have. I can use it in my SQL statements as if it were a table. For example, this will select all tasks descending from task 57:

SELECT decid, taskname FROM taskdescendants(57) INNER JOIN task ON decid = id;

That's all well and good for a generic function, but trigger functions have slightly different requirements. This will check that the new parentid doesn't show up in the list of descendants:

CREATE OR REPLACE FUNCTION taskrecursioncheck() RETURNS trigger AS $fun$
BEGIN
 IF NEW.parentid IN (SELECT decid FROM taskdescendants(NEW.id)) THEN
  RAISE EXCEPTION 'task cannot descend from one of its own descendants';
 END IF;
 IF NEW.parentid = NEW.id THEN
  RAISE EXCEPTION 'task cannot descend from itself';
 END IF;
 RETURN NEW;
END;
$fun$ LANGUAGE plpgsql;

CREATE TRIGGER parent_trap BEFORE UPDATE ON task
 FOR EACH ROW EXECUTE PROCEDURE taskrecursioncheck();

There - now it raises an error on invalid relationships and I got a handy function that my front end can call. Later I will have to figure out some way to ensure my other rules are followed.

/dev/neal