Let me explain how you can save a lot of developer hours by boiling those spreadsheets down and leaving only what’s useful for the project.
As an example, we’ll use Marv’s Hog Jowl Emporium, a completely real business that wasn’t made up on the spot. Marv Jr. is looking to modernize his family business by upgrading its Frontpage site with parallax scrolling backgrounds and slick data visualizations.
The Emporium uses Microsoft, so its data will come from a series of spreadsheets. This all looks great in Excel, but it’ll probably cause confusion when he sends it over to the agency.
So, at this point we have Marv’s data, but there are a few problems and ambiguities:
- The data is in proprietary binary format.
- The spreadsheet has 100,000 rows, 1200 of which are used.
- It uses lots of styling, frozen rows, formulas, images, etc.
- It has mysterious/unexplained/unintelligible columns.
- Its columns include data in different formats.
Let’s run through each, and I’ll explain some ways to make this more usable.
File type (xslb vs xslx)
The screenshot doesn’t really depict it, but the spreadsheet was saved as a xlsb file. Xlsb is a format that’s hard to work with for non-Excel users. Since web developers aren’t spreadsheet experts, they’ll try to open your data with Libre Office, Google Sheets, or some other free app that doesn’t read Microsoft’s proprietary xlsb format.
A great first step for Marv would be to save his spreadsheet as an xlsx, or csv file. This way, whoever gets the data will be able to work with it in pretty much any spreadsheet editor and any PHP developer can figure it out. Nice!
Empty rows
Now that the spreadsheet is in xlsx format, the file is gigantic. This is because a few years back, Marv’s analyst typo’d while entering a formula, which brought the spreadsheet up to 100,000 rows, and only about 1200 contain data.
Even though the file format is now readable, xlsx uses a lot more data than xlsb, so Google Sheets will likely freeze while attempting to import the file. So Marv (or his analyst) should definitely delete those 98,000 empty rows.
If the data actually does take up that many rows, Marv would send the file over in CSV format.
Styling, formulas, etc
Since neither of these apply once the data makes its way into mysql, neither will affect the process of getting it there, with a couple of exceptions:
- Different spreadsheet apps sometimes have different syntax for their formulas. So, if you have complicated formulas, it’s probably best to send a version of the spreadsheets with the values instead of formulas.
- Same as above if you are using Power Query or something like it to pull in data from some external site or API.
- Any notes or comments that exist outside of the tabular rows of data should be communicated via some other means, ideally in writing if they need to be part of the data’s presentation on your site. This is especially true for annotations added in Excel, since most other apps have trouble showing them.
Unexplained Columns
Any columns that don’t directly relate to what is being communicated with the visualization should be removed, if possible. For instance, in Marv’s spreadsheet do we really need J-14 and J-14a?
Let’s say those columns have to do with some jowl-processing machine, how far out of spec it is in micrometers, and whether it has received its required weekly maintenance.
Probably great to know for some specific person working at the Emporium, but it’s just confusing for outsiders. Best to remove it or give the devs a heads up that they can safely ignore those columns.
Format uniformity
Try to stick to one format in each column, especially if the column contains numerical values or dates. For example, check out Marv’s “signoff” column. Whatever method was used to enter these dates resulted in a column that will need to be fixed, so it’s a really good idea to get those outliers (“thursday”, “june 14th”) into the same format as all the rest of the signoff dates.
Tip: Generally it will be much more cost effective to clean your data in-house than to have an agency dev do it.
I admit the end result is pretty bland. Moral of the story here is that spreadsheets are to provide data to humans (for the most part) and databases are for computer consumption.
There’s no data visualization project that doesn’t involve at least a little bit of cleanup.
But following these guidelines should clear up a lot of ambiguity right off the bat, which is always going to save time for those involved.