Lately, I’ve found myself with a few requests from friends and users that have a particular problem: they’ve got themselves a data lake in Azure, and they can read and write files just fine to it. The problem, though, is that sometimes they need to take a series of files and mash them all together, or as the cool kids call it: concatenate them. And when it comes to third party tools and methods that can do the trick, you’re spoiled for choice: Azure Data Factory, Spark via Databricks, or even PowerShell.
Case in point: I was working with someone who had tens of thousands of CSV files that they needed to merge together into one big file, but they were already out in their Azure storage account. That doesn’t sound so bad, does it? Well, consider this:
Oh and one other requirement? It has to be fast; if it’s going to take hours, then it’s not going to work for them. They want an efficient process that is as fast as possible, which (usually, but not always) means “as cheap as possible on cloud resources.”
cat
got your tongue?Linux, as always, has solved this problem for us. The cat
command is a miracle of computing that allows you to quickly and effectively get the contents of files and stream them to one target file. Best of all, it’s fast. For the sake of testing, I can take 20,000 512KB (or, roughly 10GB total) files and combine them into one:
time cat evdata* > combined.csv
And the results are pretty good; my baseline for this activity takes roughly 1.75 minutes to merge the files (measured via the Linux time
command):
But go back to the start of this blog post: My files aren’t in my local file system, so I can’t take advantage of cat
, right?
blobfuse2
: Electric BoogalooThat’s where you’re wrong, kiddo: Microsoft maintains a virtual filesystem driver called blobfuse2 that allows you to mount your Azure Storage Accounts to Linux. Using this utility, we can mount the storage account in question, and then use the cat
command to merge our files as if we were accessing the files locally. Before we get into a demo though, understand that you can’t beat physics: using this utility to access a cloud resource with a non-cloud resource (like a local VM) might work, but you’d be missing the point here: For this scenario, we’re going to use an Azure Virtual Machine to do the work (and also allow us to take advantage of Managed Identities to access the storage).
So how do you do this? Just follow these five easy steps:
blobfuse2
does do well with more memory and compute, so depending on your data sizes and volume, you can see better performance with a bigger VM. I don’t have super prescriptive guidance here; test it yourself and see.sudo wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update
sudo apt-get install libfuse3-dev fuse3
sudo apt-get install blobfuse2
mkdir .azcache
logging: type: syslog level: log_debug components: - libfuse - file_cache - attr_cache - azstorage libfuse: attribute-expiration-sec: 120 entry-expiration-sec: 120 negative-entry-expiration-sec: 240 file_cache: path: /home/drew/.azcache timeout-sec: 120 max-size-mb: 4096 attr_cache: timeout-sec: 7200 azstorage: type: adls account-name: dublindatasolutions endpoint: https://dublindatasolutions.dfs.core.windows.net mode: msi container: incoming
Taking a closer look here, what does this file actually do? It the file responsible for configuring how blobfuse2 will mount our storage. As I mentioned, there are a lot of additional configuration options and features to the tool, and undoubtedly there are more knobs and levers you can turn if you’re thirsty for more performance and/or control. The key parts of this file, aside from the managed identity authentication I mentioned already, is what storage account and container we’re connecting to, and where we want to set our local cache to. This is a great starting point if you want to try it yourself, just update your file_cache
location to your home directory and folder, as well as your storage account details.
If everything worked, it’s time to check the mount. In your terminal window, type the following:
blobfuse2 mount ./az --config-file ./baseConfig.yaml
All we have to provide to the command is what subfolder we want to mount our storage container(s) in, and the path to the config file we saved. If everything worked, you should get dumped right back to a prompt. Now, let’s explore our mount directory we created, and:
We’re in. In this example, what you see here is everything that exists in the storage container inside the storage account I mounted. You can also mount all your containers on a storage account too, with mount all
, if you prefer. The files I want to merge is in the ev_data
folder in this example, and I’m not sure what a screenshot of 20,000 files would be, so that’s all you get.
Now that we’ve got our filesystem mounted, we can go ahead and run our cat
command to merge the files. You may be wondering how it performs? Well, using my same baseline test as above, here’s what I see, performance wise:
Almost 20 minutes to merge 10GB of files; I know that’s a lot slower than doing it locally, but like I said: we can’t beat the speed of light here with the VM (even in Azure) going back and forth to a storage endpoint. That said, compared to other methods, setup aside this is probably going to be the most performant method for you.
And the best part? If you factor in the time it takes to run my VM to do this test, it cost me. Also, since the VM and storage are the same region there’s no egress cost, either. Rounding up to the nearest hour, this entire operation cost me a whopping $0.10.
Shortly after this post originally went live, someone asked the question, what about header rows in your CSVs? Don’t they get all merged together? Or what if there aren’t headers, but you want to add a header row?
To be clear: the original intent of this post was to highlight how cool blobfuse2
was, but it’s a valid question. The problem statement that stirred this whole post was because the data didn’t have any header rows at all, nor did it need any. The data exchange that used the combined files had a pre-defined schema that was applied when it was read back into the target system, so it wasn’t a big deal (as in, they would have skipped a header row if it was there). You might have different requirements though, so let’s discuss some options.
Again, Linux to the rescue here: instead of cat
let’s use tail
. Tail is like cat except that it reads the file in reverse, and if you want to skip the reader rows, you just read all the rows except one (since it reads it in reverse). The trick is to use -n
parameter to tell it what line to start it, and we’ll use +2
to tell it to start at line number 2 (since line number 1 is the header). Which is just awesome if you’re used to zero-based indexes your whole damned life
Check out the results of cat
vs tail
below in my screen shot:
You can see the first, regular command prints out the whole file, while the latter prints out everything after the first line. Even though tail
reads from the bottom-up, it still outputs the file in order.
And if you don’t have a header and you want to add one? Just create a new file with one row that has your header and then make it part of your folder that you want to combine your files in. It’ll process in whatever name order your folder is sorted it, so just make sure it’s the “first” file.
When it comes to complex problems, you don’t always need the most modern solution. Commands like cat are really good at what they do. Instead of developing a whole new data engineering process to meet your needs, sometimes it’s just best to go with what works. And thanks to tools like blobfuse2, that process gets even easier in our increasingly cloudy world. Whether this is a one-time exercise to merge some data, or something you wanted to automate on an ongoing basis, it’s a strong, efficient, and cost-effective solution to a problem you’d normally throw a lot of money at.
The biggest question you might have after reading this is: how could you operationalize this as part of a data engineering pipeline, or some other process? Since this is an OS-level command, you’d need a tool that can connect and run such a command on your virtual machine. It would also depend heavily on what kind of workflow or data engineering workflow tools you use; I certainly can’t hazard too many guesses but there are lots of ways to solve that problem.
The other thing to call out here is that maybe this isn’t something you even need to do; the person I was helping with this has a super-specific requirement for this. The fact is that for most data ingestion pipelines, one file instead of multiple files isn’t really “a thing.”
And finally, on performance: Remember, in Azure, the more you spend on a VM, the better performance you get, and not just compute. Network performance is also tied to your VM class and size. Therefore, you may see better performance if you pony up a little more cash.
So, if you find yourself in a place where you need to concatenate a bunch of files in your cloud storage account and you want to do it quickly and efficiently, you may want to give this method a try.
Pingback: Concatenating Many Files in Azure Blob Storage – Curated SQL