How do you explore data ? What tools do you need to find insight and connect with other data points or systems? Do you need to output manipulated data?

I am currently working on a project that involves cataloging several hundred thousand items (products in this particular case). The supplied data files contain about 20,000 lines each making manipulation with tools like excel bulky and cumbersome. (at least on my desktop!)

For my client this data has to be analyzed, organized, manipulated and distributed. I could use a tool like Power BI to explore the data however I still would need to import the data to DB for my particular use case ( mainly because I need to do a lot of manipulation and I do not like to connect 50MB+ excel files to BI tools, they get to slow for my taste)

MongoDB ( or Hu”mongo”us DB) came out around 2009 and is a non-relational database with no defined Schema allowing you to freely search and manipulate the data. The flexible Schema means that importing my CSV data into Mongo is a breeze with GUI tools like Studio3T, as long as I define my field Types at setup correctly ( number should be integers and text should be strings etc). I also don’t have to worry about setting up correct columns to match those of my data file, I can simply have the database create a “document” for each row and add all the columns “properties” that the product has. Eventually I will need to nest product properties inside themselves for variations so this distinction from Excel is important.

In NoSQL (non-relational) database terminology we no longer have tables, columns and rows since each entry is stored as a separate entity called a document which in turn belongs to a collection. Each document in the collection then has a series of key:value pairs (or properties). These properties can be as simple as a integer or as complex as a nested Object or Array, allowing for immensely complex but easily traversed data structures.

With a MongoDB instance up and running I can traverse my data in nearly infinite ways. I can also use pure Javascript to write out scripts to rather easily manage and manipulate the data. For my purposes I need to do some complex updates and additions to the data in order to output it into an API for addition to the customer’s eCommerce platform , which I am also designing. However I won’t be going into that in this article.

There are various tools you can use to work with MongoDB, including those that offer a GUI (graphic user interface). I decided to try out one such tool this week, Studio 3T, a direct DB connection that can be used with most of the major Database as a Service providers. I am currently using the Mongo’s teams MongoDB Atlas for this project as the entry tier is free and it scales quickly and easily (oh yeah it’s built by the people who built Mongo so I assume it utilizes the database’s best practices). It is not the cheapest service as leaders in the space since companies such as mLab have larger free tiers.

Connection to Studio3T is as easy as getting your connection URI string and authorization credentials and plugging them in!

Set a unique name for the connection. Then click the From URI button to bring up your connection URI input. If you are using MongoDB Atlas your 3.6+ driver will look something like this:

mongodb+srv://adminUser:<PASSWORD>@DATABASE-10pit.mongodb.net/test?retryWrites=true

Click Ok and move over to the Authentication tab and select SCRAM-SHA-1 from the dropdown. Enter the username and password you setup when setting up Atlas (hint you should have setup the admin user and remembered to allow access from your ip). Enter the Authentication DB which is admin if you are using Atlas. You should now be able to connect to MongoDB Atlas.

Now I have a query builder and a intelliShell scripting window. In addition I still have full command line access and ability to export my results out to JSON to CSV formats. Now whether I am developing a data model for the application or just looking to output some compiled stats on the data, I can do so quickly and easily from MongoDB. Additionally, since this is now a database I can connect other BI tools such as MS Power BI or SAP BO for quick data mining and reporting.

As you can see above Studio 3T parsed and returned the full dataset in just under 1 sec. Running Excel on a similar computer lead to 30s+ delays opening and filtering data.

When I applied a query filter to look for just those document that have a “null” aperture and a “4in” width it took just about 1.2s. This is without creating an index from any data fields yet! When working with excel and trying combine filters, my computer was freezing regularly which is in and of itself a great reason to learn MongoDB!

I have been using Studio 3T for just a couple days but I have noticed the application hanging sometimes, even when it wasn’t being requested to do anything. It may just be my computer with too much running, a conflicting port or it could be the software. After I use it for a few weeks and figure out the nuances I will post an update here. Using the command line on the same dataset however was stable and quick. However learning the command line for new users may not be the easiest but offers the best performance.

Of course it’s not so easy to go from Excel to using a database if you are unfamiliar with programming but it certainly worth the effort of learning so that you can open up a whole new world of data!

What tools do you use to mine and manipulate data ? What is your use case? I love hearing about the latest and greatest so feel free to drop me a line!