In the past, I have worked with created databases in order to query information and with data analysis tools in order to infer explore data relationships. I want to start collecting my own data of interest so that I can explore analysis in unique ways. This is my first attempt to collect and store data for such future projects. I want to start off with something relatively straightforward, so I am choosing a readily available data set to work with. I am going to be taking name data collected by the US Social Security Administration and adding it to a database.
To begin with, the data I am accessing is available at https://www.ssa.gov/oact/babynames/limits.html. I am only interested in the National Data at the moment, but may also grab the state specific data later.
The data comes as a zip file containing a text file for every year of data since 1879. There is also a README supplied describing the format of each text file.
The goal of this project is to convert the data from a series of text files into a database. I want to be able to write python, php, and sql queries against the data and may attempt to write an API that can be used from a page on this site.
I am using SQLite to store this data as a database.
- Import python packages
Import StatementsPython12import sqlite3import os
- Create a database connection
SQLite ConnectionPython123# Connect to or create the databaseconn = sqlite3.connect('names4.db')c = conn.cursor()
- Create a table
Table CreationPython123456789# Create tablec.execute('''CREATE TABLE IF NOT EXISTS namestats (id INTEGER PRIMARY KEY,name TEXT,sex TEXT,occurrence INTEGER,year INTEGER)''')
- Read from the table to determine the next ID
Python123456789101112c.execute('''SELECT id FROM namestats''')ids = c.fetchall()idlist =for item in ids:idlist.append(item)idlist.sort()if len(idlist) == 0:lastid = 0else:lastid = idlist[-1] + 1
- Read the files in a given folder and create a list of the file names
Python123456789# Note the r to denote the path as a raw stringfiles = os.listdir(r"C:\Users\file\path")filelist = for item in files:if item[-3:] == 'txt':filelist.append(item)filelist.sort()
- Open, read, and add the contents of each file to a tuple. The inner for loop appends data to a new tuple from the reconstructed column data. The file data contains the name, sex, and occurrence information, while the filename provides the year data. The index and year are added to the other data to create the final tuple row.
Python123456789rowinfo = for file in filelist:entries = tuple(open(r"C:\Users\file\path" + "\\" + file, 'r'))filename = file[3:7]for index, item in enumerate(entries, start=lastid):name, sex, occ = item.split(',')rowinfo.append([index, name, sex, int(occ.rstrip()), int(filename)])lastid = index + 1
- Add the tuple to the table
Python1c.executemany("INSERT INTO namestats VALUES (?,?,?,?,?)", rowinfo)
- Commit and Close the database connection
- SQLite was a good choice to test the script functionality. It is lightweight and doesn’t require an installed service to run.
- SQLite was not a good choice for the real database storage. The number of entries ended up being more that 9 million. This made the table too large to be used. When attempting to read the table back with python, the connector was never actually able to return any results. In the future, I will only utilize the SQLite functionality to test the parsing and formatting of my data collection on a subset of the intended data.
- After some thought, I realized that appending to a single, growing tuple was cumbersome on resources, so I will revise the for loop to append the results of each file to the database before updating the tuple with the next set of information. This will allow for iterative writing to the database rather than one large “executemany” clause.
Ultimately, this was successful; however, the result was not useful for my particular scenario. The SQLite approach is handy for quick testing, but I am interested in doing something with the data in a web-ready format. Since I want to move this database to my hosted web server, I need to get the data into a MySQL format. I did not find a straightforward way to convert between the formats, so I am going to rewrite the script to store the data in MySQL in a future post.