Sunday, July 10, 2011

Automata for the People

Digital communication protocols are built up in layers. In the familiar case of the Web, HTML content is delivered on top of HTTP, which in turn is carried by IP, which is handled by TCP, which rides atop some hardware protocol like Ethernet, cable, or DSL. This protocol lasagne is called a network stack.

I designed my Telnet NVT around the idea that Telnet should be treated like a transport layer, and content should be handled separately in a higher layer. I'm glad I wrote it this way, because when I tried to work with the content I was reading from my NVT, I found it full of control codes for another protocol: VT220 terminal emulation.

VT220 terminals are everywhere. The Linux command-line telnet, the popular cross-platform PuTTY client, and apparently most other "Telnet" clients actually use VT220 control codes for things like the arrow and delete keys. (Even the Linux system console is a VT220 emulator.) In their original forms, the Telnet and VT220 protocols were different ways of accomplishing similar things. But now, it seems, they go hand in hand... and VT220 does most of the work.

To move ahead with my project, I only need to recognize VT220 control codes so I can strip them out and throw them away. I don't even care what control codes I'm removing. My first idea was to use regular expressions to match them. But the catch is that their lengths aren't fixed like Telnet commands, so I would never know if the sequence of characters I'm examining doesn't match because it's not complete yet, or because it's not a VT220 control code. Java's regular expression library doesn't allow me to test for partial matches, so my program could waste a lot of time trying to match character sequences up to some arbitrary length, only to have them turn out to not be VT220 control sequences at all.

Finite automata to the rescue! Automata are the foundation of regular expression matching. By working with automata directly, I'll be able to tell if a sequence of characters is the beginning of one or more matches, and quickly pass it up to the next layer if it isn't. I need to study it a bit more before I make a decision, but I think I'm going to use the BRICS automata library for this task.

Hopefully I won't run into any more surprises like this, and by next week I'll have something you can see.

No comments:

Post a Comment