Document Home
Samples for this section

Extending the Integrated Environment


Now that the essential structure of this data entry scheme is in place, I'm ready to start actually pulling the records into the database. Or am I? Let's think about this for a minute. The script as it exists now geta a number of records from the user, writes them to a file, and then loops back t do the same thing all over again. Obviously, with this structure we're not going to have more than one user entering at the same time ... I'd have to make the script pause or they'd be stomping all over that output file before the records were actually in the database. Even if I were to do something really clunky and have users run slightly different versions of the script that specify different destination files, there is no guarantee that the records would be saved before another iteration of the script being run by any given user came back to the point of writing the file. I could have the script actually insert the records, but as I've discussed before and will again shortly I've decided that I don't want to place the overhead of waiting for that to happen on the executioin of the script. (Put another way, I don't want the user to have to wait for the addition to the database to finish.) What I should do is figure out some coherent scheme for writing a unique file each time around.


To do that, it is clear that the script is going to have to dynamically generate the filename under which the records will be saved each time around. The wrinkle that this adds to the script lies in the operation of taint checking. If you recall, taint checking attempts, as much as possible, to make sure that values not generated by the script cannot be used by a malicious user. The process of taint checking essentially involves putting a level of perl insulation on top of the submitted value so it can't be used to execute a command on the server. I may get further into the actual process of taint checking at some point, but for now what we want is a perl mechanism to specify how things should be written in a fashion that co-exists with taint checking.


The scheme ultimately used to assign the filename is the third I've tried; I'll go through the first two just as an illustration of the path development can take. My first shot involved reading the IP address in the REMOTE_HOST environment variable and using that as the basis for the filename. Since this comes from the environment, and is external to the script, I ran into difficulty with taint checks when I attempted to create the filehandle (the hook to the specified external file that perl uses). This difficulty was ultimately resolved through the use of the FileHandle::Deluxe module, but as I worked through the problem it occurred to me that as it is possible to spoof an IP address in an http request it would be possible to store a command in that variable. While that possiblility might be relatively remote, most security vulnerabilities are based on problems far more obscure. Given that, I determined to adopt another approach.


My second approach involved simply using a counter scalar that would have been progressively incremented as new users accessed the script. This was and is a workable solution that I had trouble implementing for reasons I ultimately found to be attributable to problems with the Ralphzilla system. In the process of trying to resolve that problem I derived a third approach that I felt was likely to operate mote quickly than the counter approach, and the following discussion revolves around that one. I would suggest that sometimes things work that way. As you grapple with difficulties in implementing some type of structure you find yourself approaching it from different angles, and it is in such contexts that you derive more sophisticated and elegant approaches that you feed back into your overall way of doing things. I know this sounds silver-lining optimistic, but the times when the straight-forward solution you had envisioned does not work as expected are also the times when you are forced into looking at things from different angles. If you can refrain from letting anxiety dictate your approach in such circumstances, you'll find that these are the times when your overall productivity makes the most dramtic improvements. It just doesn't feel like it at the time <grin>.



#!/usr/bin/perl -wT
##in essence, this script creates a scorekeeping
interface for the baseball database

BEGIN {
# Set the DISPLAY variable to the name of the local machine
# where the debugger window and web browser appear.
$ENV{DISPLAY} = "mymachine:0" ;
}


##set up modules and pragmas
use strict;
use CGI qw (:all);
use FileHandle::Deluxe qw (:all);
##instantiate a new CGI object and retrieve key parameters ...
if there is no action paramter, execute the get_session subroutine
my $forms=new CGI;

my $item;
my ($action,$session_id,$session_file,$pass);
$session_file='/home/www/sessions';
if (! $forms->param('action')) {
get_session();
}
elsif ($forms->param('action')) {
$action=$forms->param('action');
$session_id=$forms->param('session_id');
$pass=$forms->param('pass');
}

Now obviously the structure of the beginning of the script must be modified to incorporate the use of some sort of session identifier. From the perspective of this script and the way it operates I could just as well have generated a seperate session identifier each time a screen of data is entered, but in most ciecumstances it would be appropriate for the entire entry session to be regarded as a single entity. So my approach takes that as a design requirement.




sub get_session {
$action='start';
$pass=1;
my ($session, $line);
$session=FileHandle::Deluxe->new($session_file,append=ɭ);
$session->autoflush(0);
my $session_test=0;

while (! $session_test) {
$session_id=int(rand(100000)*10000000);
if ($session !~ /^\s$session_id/) {$session_test=1;}
}

my $out_line=$session_id."\n";
print $session $out_line;
$session->close;

}
In any event, an early line in the previous incarnation of the script assigned the value "start" to the scalar $action if the CGI parameter "action" was undefined. In this script that condition triggers the execution of the get_session subroutine, which is of course new to this implementation. As I began to work on this version I realized that while it is desireable to maintain the state of the session throughout the entry process it is also desireable to write the records from each pass into a file with a unique name, because that gives me the most flexibility in how I can go about entering the records stored in these ascii files into the database. The ramifications of that will become apparent shortly, right now I just want to explain why the second line of the subroutine initializes the scalar $pass with the value "1". It gives me a way to distinguish any given entry screen from the screen before it and the screen after it. After initializing the scalars $session and $line the script uses the FileHandle::Deluxe module to create a filehandle object on the file that holds session id's, with append rights, which means that, if it is written to, the referenced file will be created if it does not already exist. As the scalars holding the filename and the data to be written to this file are generated within the script, they are not subject to subversion. Therefore, the script could use the standard form of filehandle creation without complaint from the taint-checking pragma, but since I use the module elsewhere in the script I use it here for consistency. After creating the handle I disable buffering on the filehandle by setting the autoflush attribute to 0. What this means is that data will be written to disk as the print statement is executed, rather than being buffered in memory and only written periodically or when the filehandle is closed. If you think about this a little the reasoning will be apparent. The session id is used to maintain the state of an individual browser session. As each client starts a session it will execute this subroutine to generate a unique id for that session. If buffering is not turned off, it is possible that the same id will be generated twice, and if the first is not yet written the subroutine will not catch that. As you'll see shortly, the possibility that this will occur is extraordinarily unlikely, but that is the way many software bugs get created ... they are event structures that are very unlikely.


After creating the filehandle object, the script initializes the scalar $session_test to the value "0". This scalar controls the small while loop that actually generates the value to be assigned to the $session_id scalar. Within that loop, which executes whule the value of the $session_test scalar is "0", a value is assigned to the $session_id scalar by calling the perl rand() function with an argument of 100000. This generates a random fractional number between 0 and 100000, providing the basis of what is stored in the $session_id scalar. As we'll see in a bit, the records are written into a scalar whose name is constructed primarily from that scalar. Since the generated random number is of the form "99999.9999999", the resultant file was named something like "recs99999.9999999.txt". Now a file named like that is perfectly legal in the linux world, as indeed it has been in the windows world since long filenames became supported. Howver, as I got deeply into the script that reads these files and inserts the records, which I'll be talking about in a little bit, I had some difficulty with the portion of the script that reads in the file names and processes them. Thinking that perhaps something I was using was having difficulty with a file name that included two periods, I multiplied the generated number by ten million an converted the result into an integer (albeit a pretty big one) as the session_id. I have left it in this form despite the fact that I ultimately resolved the culprit in that context to lie elsewhere.



I have related that story primarily to suggest that such steps are perfectly appropriate to take when trying to diagnose a problem. One of the first things I look for when I'm trying to resolve a difficulty are elements of the construct that differ from the norm in some regard. It is not at all unusual to discover that an element of the software environment upon which an application rests has difficulty with a specific technique employed in the application. Furthermore, while that difficulty may be resolved by a later version of whatever package had that problem, that later version may be dependent on a software library that creates conflicts with some other component of your system. To generate an example pertinent to Ralphzilla, it's as if an upgrade to fix a problem required a library that did not respond appropriately to requests sent to it by a component of the mosix suite. In the Windows world, applications running on versions prior to Windows 2000 frequently create problems for each other by installing custom versions of common system libraries that don't implement a given function in the manner other applications expect it to be implemented. (Windows XP and, to a lesser extent, Windows 2000, resolve that issue by implementing mechanisms in which multiple copies of those files can co-exist, but that's another discussion.) This sort of thing is absolutely possible in the software world, and perhaps even more so in the open-source part of that world, which relies on cooperation between entities that are "loosely-coupled". While open-source advocates have an extensive set or arguments suggesting that their model of development will tend to evolve more dynamically, it is nonetheless true that for any given permutation of components there is a potential that a specific version of a component will have some level of trouble with something you are trying to implement. Just keep that in mind. It is quite possible that when you run into a problem, it is not due to a flaw in your work. Break the construct you've created into smaller components, and make sure that those elements are functioning appropriately, that what is returned from a given function call is what you expect to have returned. You may have to develop an alternate expression for what you are trying to do.


After generating the $session_id scalar, a pattern match is executed within an if statement, determining whether that specific $session_id has previously been used and stored in the file referenced by the $session filehandle object. The specific statement used says, in effect, to look through that entire file for an occurence of the string held in $session_id, and to look for that string on a single line. If the pattern is not there, store "1" to the $session_test scalar, which means that it is okay to continue. Perl's pattern matching and regular expression capabilites are a powerful feature of the language, and more than a few books have been written specifically on that, so I won't attempt an extensive treatment of the subject here. Take a look at the links page, I'll find some good references and post them there. Very briefly, the pattern match operator in perl is "=~", which effectively translates into "contains the pattern following". The negation of that, "does not contain the following pattern", is expressed as "^~", and is what I use here. You may note that I originally set the scalar $session_test to represent a condition that would not let the execution of the script out of the loop, and require that the pattern match not find a $session_id before it can continue. I could have set that value to "1", and have a positive result to the pattern match set the $session_id to "0", the logic of either statement is equivalent to the other in a mirror image fashion. I generally prefer to construct key conditional tests in this manner, because in more complex pattern matches there may be ways in which an expression framed in a way that is just slightly wrong could return a false positive result. It is also easier to debug a context in which the script in which execution is never let out of a loop than it is one in which it is always let out of that loop. You may never recognize the latter ... I guarantee you will always recognize the first.<grin>


Obviously, this pattern match and the file holding used session ids is only relevant as long as at least one other user is accessing the system, but there is no ready way to determine when a given session id is not longer active, beyond constructing some sort of system for recording the time of the last exchange associated with that session id and making some assumptions about the appropriate time-out value for a given session. Given the large number of potential session ids, it would be far less clunky simply to schedule a cron job (an automatic job run by the linux system) to delete this file at a time when no one would be using the system, perhaps at 4am each morning. A cron job like this could ultimately have a wide range of system clean-up functions to perform, I wouldn't be surprised if I revisit the concept sometime down the road. In the absence of that, you could simply manually delete the file before the system came in use for a given game.


In the unlikely event that a given session id has been used before, a new id is generated and tested against the contexts of the $session filehandle. Once the script has determined that the generated session_id has not been previously used, the value of "1" assigned to the $session_test scalar, and execution allowed out of the loop, the script constructs a scalar holding the accepted $session_id value with a newline character concantenated to the end of it. This scalar is printed to the $session filehandle object, becoming the next line in the file, and the $session filehandle object is closed. You should recognize that it is generally considered to be bad form to use a temporary scalar to do things like simply holding the line to be written when I could have constructed that line just as well within the print statement. I do that here just for the sake of explicitness, you should feel free to do away with it in your own implementation.


Back in the main body of the script, we now have scalars holding a unique session id, an indication that this is the first pass through the script, and the value "start" in the scalar $action. Just as in the previous incarnation of the script, that value in $action will send the script into the sel_form subroutine. While in this circumstance I could as well have used the value of $pass to do that, I want to retain as much of the structure from the previous incarnation as possible, and using $action represents a cleaner implementation of the main body event handler.


In this incarnation, the sel_form and the get_form subroutines are very much the same as they were in the previous version, the only real difference being that each ahs lines printing hidden fields holding the values to $session_id and $pass back to the client. The store subroutine, however, has been substantially modified.

sub store 		{
		$forms->delete('action');
		$forms->delete('pass');
		my $file='/home/www/save/recs'.$session_id.$pass.'.txt';
		my $recs=new FileHandle::Deluxe($file,append=>1,safe_dirs=>['/home/www/save'],
									lock=>LOCK_EX) or die "can't open file";

In this version the subroutine begins with explicit deletion of the CGI parameters "action" and "pass", because new values stored in both will of course be key to the state of the session after this subroutine is executed. (Actually, the "action" parameter was present in and deleted in the previous version of the script ... as I was debugging a difficulty in the execution of this script I developed the habit of deleting parameters that I was going to change at the very beginning of the subroutine in which I change them. The deletion is very visible there, so I can be sure the deletion was made.


At this point I construct a scalar to hold the name of the output file, and use that scalar to create a filehandle object on a file of the specified name. Recall that there were two primary considerations that drove this iteration of the script:

1: Since it is possible that the application will be used in a multiple-user context, the file save by each user should have a unique name. It was this consideration that led to the incorporation of the session id to uniquely identify any given entry session.

2: Given that I decided to maintain a single id for any given session, but didn't want to assume that the file would be gone by the time a new file was ready to be written, I needed a way to give the file a slightly different name each time around while maintaining the ability to group the files from any individual session. (That grouping has no bearing on this iteration, but I wouldn't be surprised to see it surface as a feature somewhere down the road. This led to the development of the $pass counter.


The construction of the $file scalar demonstrates just how these considerations are implemented. The first portion, the quoted text string, represents the path to the file's location and the first four letters of the filename, "recs". Starting the filename with "recs" serves no real purpose beyond flagging the contents as records. Concantenated to that string are, in order, the $session_id, the $pass, and the string ".txt". If the $session_id were "007" (any real id would, of course, be far longer) the contents of the $file scalar on the first pass in this session would be "recs0071.txt", and on the next time through "recs0072.txt", and so forth.


In the next line the $recs filehandle is created as a FileHandle::Deluxe object. This is precisely the kind of context for which taint-checking was devised, because the screen is accepting input from the user. The FileHandle::Deluxe object creation is specifying, in essence, how the stream of data written to that object should be handled. In this case, I am saying that the object should be associated with the file specified in the $file scalar, that it should be opened in append mode (i.e., with read, write, and create permissions), that the directory "/home/www/save" should be considered safe, and that the file should be opened with an exclusive file lock. In general, what is being said here (among other things) is that the specified file, being written to a directory considered as safe, should be considered as safe from the standpoint of taint-checks. The exclusive lock that I place on the file has a different purpose, that I'll get to as I discuss the script that actually stores the records.


It is worth noting that taint-checking provides a good basis for securing a system from malicious commands submitted under the guise of posted data, but it is really only the first line of defense. For example, writing to a specified repository is fine, but if the normal system paths (/usr/bin, etc.) are visible in the environment the web server is still at risk from a piece of code that manages to avoid taint checkds. Further steps to secore the system should involve removing standard search paths from the environment space visible to the script and processing the submitted data before it is written to the filehandle object. Since I'm still working on the assumption that the application is being developed on an internal network and thus in an environment that is at least quai-trusted, I'm not goung to implement those considerations in this implementation of the script. They will, however, be coming soon.


Following filehandle object creation, the subroutine proceeds much as it did before. You may notice that I've changed the format of the output record from the comma-delimited form used previously to one in which the data elements are delimited by the string "ZzZ".

my $line=$ecj."ZzZ".$rcj."ZzZ".$pcj."ZzZ".$ercj."ZzZ".$etj."ZzZ\n";
As I was debugging the script that inserts the records it occurred to me that there are a number of record entry contexts in which it would be perfectly appropriate for the user to enter a comma into a string of text, and it is always possible for that to be done inadvertently. If I were to continue to assume a comma as a field delimiter, the parsing operation that read the line would take the portion of the string before the comma and erroneously assume that the portion after the comma belonged to the next data element, skewing what is stored in the entire record. I set about changing what is used to separate elements to different characters, only to quickly run afoul of the fact that most of the funny characters that I wanted to use as delimiters are also special characters for perl, so I'd have to escape that functionality when I expressed the delimiter in the split() function, and that would make the command far less legible for no real functional purpose. So I just cast about for a string that would have virtually no chance of occurring "in nature", and hit upon "ZzZ". Many others would have worked just as well. You may also notice that shortly after writing the lines to the output file I increment the $pass scalar, reflecting the fact that the entry of one screen of records is complete.


Ok, so the script is writing a uniquely-named file into the storage directory with each pass. How do the records get into the database? That's the subject of the next section.




Inserting the Records