Genealogy and Computer Databases

by Gregory J Winters
June 2007


If some of you have been following the recent ups and downs of our website in regard to computer technology, then you might have a greater appreciation for what I'm going to set down in this space.  However, this discussion is meant for anyone who is currently dealing with genealogy software and systems or may be considering taking it up in the future.

This article will not be a genealogy 'how-to' session.  That has been taken care of many times over by the Site of All Genealogy Websites - Cyndi's List.  I encourage anyone who is considering taking up family history research to first visit this site.

Instead, I wish to speak about a subject that very little has been written about on the Internet (or elsewhere):  managing computer databases of genealogy research materials.  I have had to learn most of this the hard way and would like to pass along my experiences and conclusions so that others might not have to slog through the same bumpy roadway.

It is important to begin by explaining the concept of a database.  A database consists of a number of elements brought together in a system to assist a user with storing and retrieving data - it's that simple.  Databases consist of records which are the information units meaningful to the user, fields which store each datapoint of a record, tables which organize and store data in simple lists, indexes which provide unique identifiers to each record, and datatypes which control the sort of data which can be entered into the fields.

A simple example of a database is the common address book.  Here, a 'record' is one of the entries in the book, usually filed under a Last Name or Company Name.  An example of a 'field' would be a Last Name or a City or a Telephone Number.  You can see that there are many individual fields which make up one record.  The information which the fields contain is called a 'datapoint' because although it is essential to the overall record, it is not meaningful by itself.  (In other words, having a Name without contact information is just as non-useful as having a Telephone Number with no Name associated with it.)

The address book could be a little more complex than just a dictionary-style listing.  It could have elements assigned to some of the entries such as 'Personal' or 'Business' to designate the types of contacts, saving browsing time.  You could make holiday card lists by assigning Categories to various entries in the address book.  This data would be stored in 'tables' - a table of contact types.  (The address book itself is a table.)  The tables are indexed by some sort of unique identifier which will not allow duplicates of a record to be created.  (For example, it would be a mess if you had a friend's telephone numbers entered in two places, then could not remember which was the new one and which was the disconnected one.)  The indexes force edits of existing records, or the creation of new ones.  Finally, datatype controls prohibit meaningless data to be entered and displayed in inappropriate fields.  For example, a Zip Code field for the U.S. consists solely of numbers, no letters.  A datatype control would not allow the user to enter letters in the field.

This is but the briefest of an overview of the elements of the database, but will suffice for our subject at hand.

Once a database is established, it must have a way of communicating with it:  how does the user enter and retrieve data?  This software is knows as the application or the user application interface (UI).  The application provides the windows and buttons and other screen objects which are linked to the database to allow the user to perform tasks with the data.  For example, if you've ever used the Windows Explorer, you are using an application interface.  The files you see are stored on your hard disk drive in a Microsoft file system known as NTFS (FAT32 for you folks with older Windows systems).  The Explorer displays these files in the form of folders or directories and provides simple tools for display, storage, and retrieval.

NOTE:  It is important to contrast the term database with another which is frequently (but incorrectly) substituted for it - relational database.  The former is concerned only with organized storage and classification characteristics (such as numbers, indexes, etc.).  Relational databases, however, include tools which enable the user to ask questions of the data based upon 'connections' (relations) which have been created between the tables - the basic storage units of a database.  This type of database will be considered beyond the scope of this discussion.

Back to the genealogy...

If you've read the opening section on this page about research, then you are aware that if you wish to substantiate your research, then you must endeavor to cite the entries in your database.  Anyone can create a simple family tree, but how does someone know that the information is valid?  Without going into the basics of source citation, suffice it to say that it's obvious that the best a computer can do with physical evidence is reproduce images of it.  The artifacts themselves still exist in file cabinets and churches and courthouses and the like.  The trick is to use the computer to help you provide the references to the sources as well as help you organize the sources themselves.

Most of us have begun by utilizing some sort of popular family tree making software.  These software programs are database applications - they provide both the interface objects and behind-the-scenes storage of the data.  Most feature charts and reports and browsing tools, as well.  However, these applications are almost completely self-contained, that is, they work with what they are given.

Artifacts, on the other hand, are outside of the realm of the software and the computer entirely.  Yet, they must be managed somehow and in reference to the entries in the family tree software.  (This is something that seems to be lost on the developers of the software!)  The purpose of this article is to provide some tips on how to manage your images and other 'external' files with the computer even if that task must be outside of the family tree software you are using.

Tip #1:  Forget the Windows Explorer!  The Windows file system is set up to organize datafiles by file name.  It uses a simple alphanumeric approach which is fine for simple lists, but awful for anything more complex.  Let me provide an example.  Let's say you have an image file of your mother, so you name it MyMom.jpg and store it in some type of designated folder which also has a meaningful name, like PeoplePhotos.  Next, you have an image of your mom and dad, so now the image is entitled MyMomandDad.jpg.  Let's jump to something far more complex to make the point.

Let's say you have a scan of an obituary.  At first consideration, one might think that this is a no-brainer.  An obituary almost always pertains to the death of only one person, so you might create a folder entitled Obituaries and name the file AuntSoandSo.jpg and be done with it.  However, image files rarely serve just one purpose, and this is where the complexity begins.

I've used obituaries as sources for far more than simply death records of the person featured in the article.  As most of you are aware, good obits contains scads of good information about not only the subject, but his/her family members, as well.  I have obits in our Narrative citing births, deaths of others, marriages, offspring, locations, and a number of other datapoints about other people.  Now the question is far more problematic:  what to do about filing and managing that image file?

Let's say that the obit created ten separate sourcing events for you.  Well, do you create ten separate copies of the image file and store them in separate folders?  Are you creating birth, death, marriage, children, parents, events, military, baptisms, religious affiliation, and the myriad of other folders for each and every person you have in your family tree database and copying these files to every related folder?  This is a grossly inefficient way to manage your cited image files and other sources, yet that is precisely what is required when using the Window Explorer to manage your files.

The next step would be to create generic folders in the Explorer then provide references to the files in your family tree software.  This also fails because of the multiplicity of uses that the image file may have.  Keep in mind that the software is based upon the records of individual members of your tree and can not relate the common references to one another.  The software will manage the references, but not the files themselves.

This leads us to Tip #2:  purchase an inexpensive file management database.  After I did this, my task was dramatically simplified.  It required a few additional hard lessons, however.  First of all, because you are now using two databases, it is imperative that you establish a common ID system between them.  I had made the mistake of allowing Family Tree Maker to automatically assign Reference (ID) Numbers to my records then manage them dynamically.  This meant that the numbers of the records could change as FTM periodically 'cleaned up' the list.  What this did to my external images and artifacts database was break the links.

Make sure that your family tree software ID's are HARD CODED, that is, once they are established by either you or the software they do not change.  If you delete someone from your family tree application, then that number simply becomes available for a new entry - it is not assigned to an existing record.  Don't worry about the numbers themselves - they mean nothing in relation to one another.  They simply identify records as unique.

Once the ID system in your family tree software has been established, you can now set that identical system up in your file management software.  I have found that ACDSee Pro v8.0 best suits my needs.  The application interface is wonderful and the customer support is excellent.  (Real people answering the phones!)  In order to establish the virtual link between the two systems, I have to go inside ACDSee and create what are known in the database world as volumes, called simply 'folders' in the application.  A volume is different than a folder in the respect that it is a raw storage unit not intended to be directly accessed.  In other words, it's like having a file cabinet that you don't access, but you instruct a secretary who is knowledgeable about the system to do it for you.

Since ACDSee is now managing the files, it is no longer necessary to invent file names and concepts in order to be able to store and retrieve data.  Myself, I have divided up the image files into three primary sets of volumes:  cemetery photos (because of the sheer size of the archive), images I own the masters of, and images I do not own the masters of.  It's that simple.  Since I do not anticipate ever getting to the point of having to manage more than 99,999 files, I have a file naming scheme that consists of five digits (starting at 00001.jpg) and simply goes up from there.

With the file manager, the name of the file is no longer important - what the file contains is.  Thus, I assign file 00001.jpg to its appropriate folder, then use the file manager application interface to assign all sorts of attributes to the file.  Attributes are facts and datapoints to assist the user in retrieving the file as easily as possible.  Unlike the folder and file name system, managing files by attributes is easy since there is no limit on the amount and type of data that can be assigned to each file.

For example, I have created sorting categories so that ACDSee can display meaningful groupings of images - those which might be contained in more than one folder.  I have categories such as 'Cemeteries,' 'Web Photos,' etc., which assist me in creating temporary sets of images allowing me to see what work I have done and where the image files are referenced.  I have assigned captions to each file which contain lengthy descriptions of the contents and characteristics of the images.  Some of these terms I have assigned to the Keywords tool which simplifies searching.

Furthermore (unlike Windows Explorer with its Microsoft proprietary structure), I can assign metatags to the files in formats which are accepted throughout the database in imaging industries.  Metatags (aka metadata) are details stored in binary form in a file's header - outside of the jurisdiction of any one software application.  This way, I can share files with important data connected to them with other folks who may not have the ACDSee application, and they will still be able to see that data.

Once the file manager database is established, then it's back to the family tree software to create the references.  Now, the task is easy:  as many references as you wish to create for the object you are free to make, without worry that you also have to manage that file within your family tree application.  You are citing a birth record, for example, all you have to point to is something like 02345.jpg, and then if you have to actually work with the citation, your file managing application can call it up in a snap.  In the References section of the family tree software, you will create a Master Reference (complete with the required bibliography information) for your archive.  That becomes your Master Source for the record, and the individual files which are part of that archive become the citations.  It's that easy!

There are techniques that I have learned as to the finer points of managing external files, but I don't wish to go into them here.  Feel free to contact me, however, and I will do my best to assist you.