CGI Demystified

Abstract

CGI is a mean for developing dynamic content for the Web. It was widely used at the birth of the World Wide Web in the mid 1990s. Since then, better technologies have been created, easing the development of more complex and scalable web applications. CGI however, is still used for HTML forms and simple dynamic content. This article explains what CGI is through a C++ implementation. (January 2004).

In the early days of the Web, each HTTP server offered its own mechanism of server-side, executable support. This is the functionality that allows a server to call upon an executable program residing on the server, to assist in fulfilling an HTTP request. In order to provide a standard interface so that developers could write general server-side programs for any HTTP server, the National Center for Supercomputing Applications (NCSA) developed version 1.0 of the CGI. The current version is CGI/1.1.

What is CGI?

The Common Gateway Interface (CGI) [1] is a standard, not a specification. Contrary to many people's belief, CGI is not a programming language either. It defines how input information is passed from user-agents, such as web browsers, to executable programs on the server, and how the latter send data back. Consequently, programs conforming to the CGI standard can be written in any programming languages, as long as the server can execute them. They may be written in compiled languages (CGI programs) such as C, C++, Pascal, or in scripting languages (CGI scripts) such as Perl, Python or shell scripts. In the remainder of this document, the term gateway program is used to refer to CGI executables, whether they are compiled programs or scripts.

Passing Parameters

One way to pass parameters to a program executed from the command line is using command line options and arguments. However with CGI, command line options and arguments cannot be sent directly to the program. Instead, parameters are passed through environment variables. Table 1 lists the environment variables defined by CGI that the server sets when executing a gateway program.

Additionally, the HTTP request header fields [3] are placed to the environment variable with the prefix HTTP_ followed by the header name in upper case. Any hyphen (–) in the header field name is changed to underscore (_) in the environment variable name. The server may exclude any header that it has already processed such as Content-type, Content-length, etc.

Let's examine these environment variables through an example. Supposing that I use a Mozilla browser on a PC, and execute a fictive CGI script on my server via the below link:

http://www.nwsummit.com/cgi-bin/script.cgi/some/dir?param1=val1&param2=val2

The name of the PC is myPC and its IP address on the local network is 10.0.0.127. The Example column in the table below shows the values of the corresponding environment variable.

**Table 1** - CGI environment variables.
Environment Variable	Description	Example
Server specific environment variables
SERVER_SOFTWARE	Name and version of the server running the gateway.	Apache/1.3.20 (Unix)
SERVER_NAME	The server's hostname, DNS alias or IP as it would appear in self-referencing URLs.	www.nwsummit.com
GATEWAY_INTERFACE	CGI revision run by the server.	CGI/1.1
Request specific environment variables
SERVER_PROTOCOL	Name and revision of the protocol used to send the request.	HTTP/1.1
SERVER_PORT	Port number to which the request was sent.	80
REQUEST_METHOD	Method with which the request was made. For HTTP, the methods are GET, POST, PUT etc.	GET
PATH_INFO	Extra path information as sent by the client. This is the part after the gateway program's name and before the question mark '?' in the URL. It is usually sent as extra context-specific information to the gateway programs. For instance, one could use it to transmit file locations.	/some/dir
PATH_TRANSLATED	The actual path corresponding to PATH_INFO. The server translates the virtual path relative to the document root provided in PATH_INFO to the actual physical path.	/document-root/some/dir
SCRIPT_NAME	Virtual path (as appears in the URL) to the gateway program being executed. It is usually used for self-referencing.	/cgi-bin/script.cgi
QUERY_STRING	Information following the question mark '?' in the URL. It contains the parameters sent by the user-agent.	param1=val1&param2=val2
REMOTE_HOST	Name of host from which the request was issued.	myPC
REMOTE_ADDR	IP address of the remote host making the request.	10.0.0.127
AUTH_TYPE	If the server supports users authentication and the gateway program being executed is protected, this variable contains the authentication method used by the server to validate the user.	N/A
REMOTE_USER	If the server supports users authentication and the gateway program being executed is protected, this variable contains the username of the authenticated user.	N/A
REMOTE_IDENT	If the server supports RFC 931 identification [2], this variable contains the remote username retrieved from the server.	N/A
CONTENT_TYPE	For requests that have attached information, such as HTTP POST, this variable specifies the content type of the data.	N/A
CONTENT_LENGTH	For request that have attached information, this variable specifies the length of the data.	N/A
HTTP header related environment variables
HTTP_COOKIE	If the server has set a cookie, its value is returned to the server via this Cookie header. The value of the cookie header, and thus of this environment variable, is of the form: name1=string1; name2=string2 ...	N/A
HTTP_REFERER	The Referer request-header specifies the address (URI) of the resource from which the request-URL is obtained; in other words, the referrer (the header field is misspelled!).	http://www.nwsummit.com/tech/cgi.html
HTTP_USER_AGENT	Name and version of the user agent (e.g. browser) originating the request.	Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007

Parsing Input

For GET method, input data are set to the QUERY_STRING environment variable. For POST and PUT methods, gateway programs must read input data from their standard input (stdin). The length of data to read is specified by CONTENT_LENGTH and the type of data is specified by CONTENT_TYPE.

When information is sent from HTML forms, the input data are sent as pairs of parameter/value separated by ampersand '&', each pair being of the form name=value. The value part of each pair is URL encoded, i.e. spaces are replaced with pluses and some other characters such as '=', '&' and '%' are encoded into hexadecimal preceded by '%'. For instance, if an input element is named "title" and its value is "Tom & Jerry", then its parameter/value pair is URL encoded as "title=Tom+%25+Jerry".

Thus parsing form data consists in parsing the parameter/value pairs from either the QUERY_STRING environment variable or the standard input stream, depending on the method employed by the user-agent to send data.

Returning Data

To return data to the user-agent that originated the request, gateway programs must send their output to the standard output (stdout). The output is interpreted by the server then sent back to the user-agent. The output must begin with a header section followed by the content returned (if any). The header consists of text lines in the same format as HTTP header [3], terminated by a blank line (a line with only a linefeed or CR/LF). Any headers that are not server directives are directly sent back to the client. CGI/1.1 defines three server directives:

Content-type: The MIME type of the document returned.
Location: This directive is used to redirect the user-agent to the URI specified as value of the directive. The gateway program is not returning data, but rather a reference to another document or resource.
Status: This directive is equivalent to the status line in the HTTP response header [3]. Its value must be the one of the defined HTTP status codes.

Since their output is interpreted by the server, gateway programs do not have to send a full HTTP header in their response. However, there is an overhead in the server parsing the output. To avoid such overhead and to talk directly to the user-agent, gateway programs must have their name start with nph-, in which case, they are responsible for sending a valid HTTP response to the user-agent.

A C++ Implementation

I wrote this object oriented implementation of the CGI back in 1996. The class interface (listing 1) is quite simple. It only supports HTTP GET and POST methods, and only the first value of a multi-value parameters can be returned.

The structure _varval defines a variable/value pair. The two supported methods are defined by the enumeration cgi_type. The class only has a few methods:

method() returns the method used by the incoming HTTP request.
exist() returns the number of request parameters, thus indicates whether there are incoming request parameters.
exits(char*) returns 1 if the specified parameter exists; 0 otherwise.
value(char*) returns the value of the specified parameter; NULL if it does not exist.

Listing 2 shows the actual implementation of the class.

In order to parse the input, we need to be able to decode URL encoded input data. Decoding consists in transforming escaped characters from the input back to their ASCII form. An escaped character is of the form "%HH", where HH is a 2-character string representing the hexadecimal value of the original ASCII character. Decoding is done here through the helper function x2c (lines 12-20), which takes a string representation of a hexadecimal value (i.e. the HH part of the escaped character) and returns the corresponding ASCII character (i.e. the original character).

The meat of the implementation is from parsing the input data (lines 50-96). The parsing is done in two linear reading passes of the input data. The first pass counts the number of variable-value pairs so that proper memory can be allocated for their storing (lines 56-59). The count is based on the number of variable-value pairs separator '&'.

The second reading pass extracts the variable-value pairs and stores them in the previously allocated array of varval's (lines 62-96). As characters are read from the input, they are copied to the temporary buffer tmp. When URL encoded characters are encountered, they are decoded before being copied to the temporary buffer (lines 66-72). When a '=' or '&' character is encountered, the content of the temporary buffer is saved respectively as a request variable or value (lines 73-85), as all characters before '=' constitute the variable and all characters after '=' constitute the variable's value.

With the parsing function implemented, the constructor (lines 22-48) consists in determining the request method, getting the input from the right source depending on the method, then calling parseInput to extracts the variable-value pairs.

Conclusion

As with anything, one of the keys to a successful development is using the right tool for the right project. Even though we now have more sophisticated technologies than CGI, such as J2EE or .NET, it does not make sense to use these technologies when it comes to developing a quick HTML form for sending emails or for collecting feedbacks. This explains why there are still a great number of CGI scripts still in use.

One needs to remember though that once put on the Web, any gateway program is available to the world to execute, including to malicious people. Therefore extra cares should be taken regarding security. The Security FAQ [4] from the World Wide Web Consortium has an extensive coverage on CGI security.

The CGI implementation presented herein shows how request parameters can be retrieved and parsed. Although a quite efficient implementation, it lacks features such as retrieval of multi-values request parameters.

References

[1] CGI: "Common Gateway Interface", NCSA, 1995.
[2] RFC 931: "Authentication server", M. StJohns, January 1985.
[3] HTTP: "Hypertext Transfer Protocol - HTTP/1.1", R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach and T. Berners-Lee, June 1999.
[4] WWW Security FAQ: CGI Scripts: "The World Wide Web Security FAQ", Lincoln Stein & John Stewart, February 2003.

Last updated: 2004-09-14 18:35:40 -0700