Import CSV in MongoDB and keep it in sync.

Goal

I’m working on a store and business locator service. Instead of building a User Interface to add addresses to a database, I like to keep track of all the data in an Excel/Google Sheets spreadsheet. This article explains how I build a script to import CSV in MongoDB and keep it in sync with the same CSV file.

I use PHP because I’m most familiar with it. It’s a bit of a challenge with MongoDB because PHP is not that JSON-friendly. In PHP, you have to rewrite the JSON operations in object/array notation.

I picked PHP because I have the most experience with it, and it runs on any cheap shared web hosting service. The downside of PHP is that it’s not really JSON friendly (query notation for MongoDB), so you need to rewrite the MongoDB queries into object/array notation for PHP. I use the MongoDB PHP driver and a locally installed database (calling a remote Atlas instance felt too slow when developing). However, you can use the script with the free MongoDB Atlas instance too. This will save you the local installation step.

Resources to start:

Why MongoDB?

I chose MongoDB as database for several reasons:

  • Geo functionalities out-of-the-box.
  • Free tools with a pleasant UI, like MongoDB Compass.
  • Good documentation, tutorials, and a proven track record.
  • Easy (and free) to set up cloud service: MongoDB Atlas.
  • A document database seems a good fit for the project I’m working on.

The most common and universal spreadsheet export format is CSV (Comma Separated Values). Actually, there are plenty of ways to import a CSV in MongoDB. In MongoDB Compass, you create a new collection and drop a CSV file, and in the shell, you can run mongoimport.

Requirements

  1. I like the spreadsheet as a single source of truth (SSOT) for the address data.
    • Remove a row in the CSV, should remove the document in MongoDB.
    • Remove a column in the CSV, should remove the key in each MongoDB document.
    • Empty a cell in the CSV should remove the key in MongoDB.
  2. I also like to clean up CSV data:
    • Leave out empty fields (ignoreBlanks in mongoimport).
    • Merge longitude and latitude fields into a single field to be used as a 2d sphere index (for fast geo lookups).
  3. Automate it later on with a daily cronjob or another trigger.

Importing CSV data in MongoDB with mongoimport works perfectly. However, deleting data from MongoDB when it’s removed from the CSV file needs more processing. So I decided to write my own code. This also allowed me to learn more about MongoDB.

Steps

  • Read the CSV directly from Google sheets and loop through the rows.
  • Check if specific fields (address and city) exist (they serve as a unique index).
  • Check which fields are empty, remove them from the CSV object and unset them in MongoDB.
  • Prepare update/upsert query.
  • Run aggregation pipeline to check which rows are present in MongoDB and not in the CSV and prepare those for bulk delete.
  • BulkWrite the update/delete queries of those rows from MongoDB.

Learnings

Set default collation when creating a new collection.

Adding addresses is a manual human operation, so mistakes (double spaces, forget acute symbols like é, capitals) can happen. We truly want to make sure addresses are unique and that the same address points to the same information. For example: “Dam 1 Amsterdam” is the same as “dam   1 amsterdam”. To make sure string comparison goes right, you can set collation in MongoDB:

Collation allows users to specify language-specific rules for string comparison, such as rules for lettercase and accent marks. (MongoDB collation documentation)

The best is to set a default collation; you don’t have to pass the collation parameters for each query. You can only do this during creation time.

Pick any locale, most languages have the same default collation parameters (like English(en) or Dutch (nl)).

Since I like to ignore case and other diacritics for lookup (like an acute accent, é), I picked "strength: 1" by default.  I also ignore whitespace and punctuation, “alternate” and “maxVariable".

Mongo Compass has a built-in shell. Create a database with “use,” After that, the collection, named “mycollection.”

use mydatabase

db.createCollection(
	"mycollection", 
	{ collation: 
		{ locale:"en",strength:1, alternate:"shifted", maxVariable:"punct"}
	}
)

You can also create a new database and collection in one step in MongoDB Compass. Use the settings above when you select “Custom Collation.”

Even with a collation set, I prefer to clean up the data as much as possible in PHP. I change all characters to lower-case and remove additional whitespaces to clean the data for displaying online later anyway (I will use text-transform: capitalize; CSS on address and city fields when displaying them).

$row['address'] = strtolower(preg_replace('/\s+/', '', $row['address']));
$row['city']    = strtolower(preg_replace('/\s+/', '', $row['city']));

Unique compound index

MongoDB has a default unique index field called _id. We don’t have that in our spreadsheet, which is the source of data. I needed to consider the fields unique that makes a row unique. In this address database case, that’s a combination of the address (street + house number) and city. There are cities with the same name, though, so we have to make sure that we add some indicators (like the province in the city name).

Luckily MongoDB supports unique compound indexes:

If you use the unique constraint on a compound index, then MongoDB will enforce uniqueness on the combination of the index key values. (MongoDb unique indexes documentation)

In the mongo shell, you can run the operation below. You can also create an index in the “indexes” tab in Compass.

db.mycollection.createIndex(
  { "address": 1, "city": 1 }, 
  { unique: true, name: "direction" } 
)

Empty fields

Updating or inserting based on CSV rows works perfect with:

$foreach ($csv as $doc) { 
  filter = [ 'address' => $doc['address'], 'city' => $doc['city'] ];
  $mongo_bulk_write->update( 
    $filter,
    ['$set' => $doc], 
    ['upsert' => true] 
  );
}

$driver->executeBulkWrite('mydatabase.mycollection', $mongo_bulk_write);

This doesn’t deal with empty fields, though. I don’t include empty cells in the CSV file in the document to save data. So if the “address2” column or cell doesn’t have data, I won’t include the “address2” key-value pair (since the field is empty anyway).

A problem arises when removing a cell value from the CSV. I explicitly have to unset those fields; otherwise, old keys are not removed in MongoDB when updating.

// remove empty columns from the row
$unset = [];

foreach ($row as $key => $val) {
  if(empty($val)) {
    unset($row[$key]);
    // MongoDB needs an associative array with empty values for unset
    $unset[$key] = 1;
  }
}

$mongo_bulk_write->update(
 	$filter, 
	[ '$set' => $doc, '$unset' => $unset ], 
    ['upsert' => true]
);

Remove a document in MongoDB when a row is removed from the CSV source.

Upserting (updating and inserting when an index is not found) works with the above setup. The challenge is to remove documents from the MongoDB collection when rows are removed from the CSV file.

The idea is to look up documents that are not present as rows anymore in the CSV but still are present in the MongoDB collection. We also have to deal with address and city as a compound index.

This is solved by adding city and address together in one field. In the for-loop city and address values are merged, with # as a separator to split it later again and stored in an array.

// in the for-loop we merge city and address with # as seperator
$csv_compound[] = $doc['city'].'#'.$doc['address'];

To check this array in MongoDB, I use an aggregation pipeline.
All credits go to this answer on StackOverflow.

First, a list ($group) is made using the $concat function so that we can compare the csv_compound array with the resulting mdb_compound array. The $setDifference function is used to check the difference between the two arrays. The pipeline returns an array with the self-created compound indexes that are still in MongoDB but not in the CSV anymore.

// in the for-loop we merge city and address with # as seperator
$group = ['$group' => [ '_id' => null, 'mdb_compound' => [ '$addToSet' => [ '$concat' => ['$city','#','$address'] ] ] ]];
$project = [ '$project' => [ 'difference_mdb-sheets' => [ '$setDifference' => [ '$mdb_compound', $csv_compound] ], '_id' => 0 ] ];
$pipeline = array($group, $project);
$cursor = $collection->aggregate($pipeline);

$remove_from_mdb = json_decode(MongoDB\BSON\toJSON(MongoDB\BSON\fromPHP($cursor->toArray())), true)[0]['difference_mdb-csv'];

This created list is used to run a delete operation, with bulkwrite.

foreach ($remove_from_mdb as $doc) {
	$doc = explode("#", $doc);
 	$filter = [ 'address' => $doc[1], 'city' => $doc[0] ];
    $mongo_bulk_write->delete($filter);
}

Publish the CSV in Google Sheets to import it in MongoDB

You don’t need an API to use Google Sheets as input. Just go to file > Publish to Web. Pick the CSV option and copy-paste the URL. By using that URL in PHP, we import the CSV in MongoDB.

I found that sometimes updates to the CSV (when saving) are a bit delayed. So you might have to wait a bit before running the PHP script. Of course, Sheets has plenty of opportunities for scripting, so you could even run the same progress I explained in PHP as an Apps script. However, I wanted to make a more general solution that I can utilize later with some web drag/drop interface

Export CSV from Google Sheets

Show me the code

You can find the complete script on Github including the Mongo dependency.