Document Home

Designing the Baseball Database

It's time to launch into creating the database schema for our baseball example. I'm not going to worry too much about the formal terminology of this process, but I will include links to more detailed discussions on the links page. This is really a very intuitive process once you get the hang of it. You might be better off following this example and then check out the more detailed treatments if you want, but the choice is your own. A word to the wise, if you ever do this for someone else you should probably produce at least a basic set of documents formalizing your understanding of that structure and get a sign-off on it. Even if you have an understanding with the people for whom you are producing something, this is a good idea if for no other reason than it is good practice, and it avoids the look of disappointment that can result when someone has inflated expectations of what you are going to do. Remember, this might be a learning experience for them as well. A lot of people look on computers as magic, and don't realize that the development process is truly evolutionary. As I've said before, sometimes you have to give people something before they realize that there are things they haven't told you.


OK, let's start working through this. Obviously, in baseball we have teams, and one of the things we said that we were interested in is constructing some sort of report to send to each team's manager. In the more formal language of the development process, this would be called a system requirement. Therefore, we're going to have some sort of record with the fax number and the manager's name. This might lead us to a structure something like this:

TEAM
team character 30
fax_num character 10
manager_fname character 15
manager_lname character 15


Now the way that a relational database works is through the use of a key field that allows you to construct a composite record of rows from different databases. So let's put one of those in:

TEAM
team_id character 3
team character 30
fax_num character 10
manager_fname character 15
manager_lname character 15


Teams play other teams in games. So let's set up a game table (ar, ar....seriously folks, I've got a million of 'em)

GAME

game_id char 3
date date
home_team character 2
away_team character 2
home_ump character 30
first_ump character 30
second_ump character 30
third_ump character 30


This gives us tables for the teams and the games. but not for players. We'll take care of that right away.

PLAYER

player_id character 5
player_lname character 15
player_fname character 15


Now that we have the players, we have to have some way to associate them with a team and a game.

PLAYER_GAME

player_id character 5
game_id character 5
team_id character 3
postion character 15
at_bats integer 2
singles integer 1
doubles integer 1
triples integer 1
home_runs integer 1
walks integer 1
on_by_error integer 1
errors_committed integer 1



Now on some levels it is perfectly acceptable to leave your data in a structure like this. There are some types of queries that are easier to make against such a design. If you look at it closely, though, you'll realize that there are a number of levels of detail collapsed in this structure. For example, there is no way to tell what happened in a given inning, there is no provision for substitution or players moving between positions, and if we wanted to find every game umpired by Mothra we'd have to search four seperate columns. There is not even any way to change managers without losing the name of the previous manager.


I'll jump right to the final structure for the database without going through all of the intermediate steps. You can probably see that in each iteration above we were, in effect, refining the granularity of the individual record. What we are really doing is going to the finest level of detail we can reasonably expect to maintain, leaving the task of storing a retrieving the data to the computer. After all, that's what it is for. The irony about this is that a fine levels of detail are generally easier to program. Remember algorithms? Since the structure of our data is very fine, all of the hard stuff will be handled in the background. If you design the database appropriately, the code you write will be lean, clean, and much easier to maintain. Hopefully this will be obvious as we get into later sections. This is a table-driven system, so the information about the system is maintained by tables in the system.



The funny pitchfork-like connectors represent a one-to-many relationship. That is, for each row in the participant table there are potentially multiple rows in the participant_game table. (There are canned packages with nice representations of this diagram element. Unfortunately, I don't have any of them. You don't like my artwork?) A major benefit of a system like this is that it is readily customizable by the users ... the level of the information maintained is determined by the specificity of the data in the various tables. By adding the appropriate rows to the apropriate tables, one could easily record individual pitches, or even shifts in defensive alignments. Although we've not really factored it into this specific table layout it's also possible to store in a system table information on how individual data elements should be processed. As you continue with this example you should keep that in mind, because such information is the basis of much inter-site communication on the web.


In this structure the table around which the application revolves (the participant_game table) is the one which maintains the discrete events associated with the subject area, the ball game. (You should recognize that this statement is purely to help you visualize the structure of the database. There is nothing inherent in the nature of a relational database system that requires anything like a central database forming a hub for the database structure. It is, however, a fairly common and very efficient structure for certain types of data, and many large systems are aggregates of many subsystems of similar structure.) The fields (or columns) in table participant_game are shown at right.. The first, key, is simply a unique identifier for any given row in this table, and as such is known as the primary key for the table. (While it is possible to have primary keys that allow duplicate values, this is considered to be very bad form. I doubt that we will get sufficiently deep into relational theory here to show you why this is the case. You can either take my word for it at this point, or research the issue yourself. Both are very wise courses of action. (grin)). In a context like this, it is perfectly acceptable to use an auto-incrementing field to automatically generate a unique identifier, but I personally prefer to create unique keys by concantenating (sticking together) several of the identifying columns in the table. My rationale for this is simple ... sometimes you find yourself in a situation in which the records in a table are not stored in the order in which you need to use them, and when that happens you are going to have to append a condition to the statement that retrieves the records sprcifying the order in which you want them returned. If the clause by which the records are retrieved has several conditions it is going to be slower to execute than one constructed of just one condition. It's that simple. Therefore I try to anticipate the default order in which I might want to retrieve the records. For example, if I wanted to look at what one player did by inning I can just make sure the database has an index on the key and select for that participant_id. (This will be more clear as we get into later sections.) The net effect of that might be a small difference, depending on the manner in which the database engine implements the condition, but the gain is real. Within the constraints imposed by your hardware, the performance of any application is determined precisely in this fashion. In this instance, the value stored in the column key is constructed by concantenating the values in the columns game_id,inning, and inning_order.


You will probably notice that six of the ten columns in this table are coded. In the context of this table these columns represent what are known as foreign keys, allowing a relationship (a join) to be created with another table. Most of the tables in the system are just look-up tables, holding descriptions of the coded value. Why do this? Beyond the savings of space associated with storing a 2 byte code instead of a 30 byte description several hundred thousand times, retrieving records based on that two byte code is much more efficient than retrieving them based on a 30 byte description, even if it is always spelled consistently.


Yoy may haved noticed in the paragraph above that I went from talking about players to talking about participants. If you also noticed that the manager's name is not longer included in the team table you probably picked up on what I'm going to do. In this structure, you would include individuals like managers and umpires in the participant table, with the role they are assume in a given game indicated by the role_code in the relevant records in the participant game table. If you think about it a little bit you'll be able to see the pwer of this approach. Virtually nothing about it is hard-wired to the characteristics of the Mega-Monsters Major League. Since players are only associated with teams in the participant_game table, participants can move from team to team between games, or even change status from player to umpire to manager in three different games. As a result the system would be well-suited to an organization that runs leagues with several age classes. You may notice that I put columns home_team and away_team in the game table rather than completely normalizing the design with a game_team structure. In essence, since there would be only two records per game in such a table, normalizing to that extent would return no benefit. The point at which you stop normalizing generally seems to be a matter of personal preference. I tend to take it further than many, my own perspective is that a normalized design has sufficient advantages in flexibility to override whatever advantages a less-normalized design has for summarization and analysis. My feeling is that I can always create scripts to create less-normalized structures to facilitate such activities, but one creates clunky, inefficient program structures when ou use a poorly-normalized database to store your transactions. I can cope with the first by leaving a p75 executing scripts all the time. The price for the second lies in applications that are slow to execute and difficult to maintain.


Other elements of the database design will become more clear as we implement the application around it. The script bbdb_create will create the tables in your database. Take a look at it now, you'll get more details on its operation as we begin the section on HTML Documents and Simple CGI Scripts. (As a take-home exercise, can you spot the table created in that script that does not exist in the database structure illustration? Hurry now, the clock's ticking.)