Fast Merging of Files in Azure Data Lake Gen2

So you got yourself a filesystem, huh?

Lately, I’ve found myself with a few requests from friends and users that have a particular problem: they’ve got themselves a data lake in Azure, and they can read and write files just fine to it. The problem, though, is that sometimes they need to take a series of files and mash them all together, or as the cool kids call it: concatenate them. And when it comes to third party tools and methods that can do the trick, you’re spoiled for choice: Azure Data Factory, Spark via Databricks, or even PowerShell.

Case in point: I was working with someone who had tens of thousands of CSV files that they needed to merge together into one big file, but they were already out in their Azure storage account. That doesn’t sound so bad, does it? Well, consider this:

  • With native tools from Microsoft, like azcopy or the azure cli (az storage), these kind commands don’t exist. Go ahead, check. I’ll wait. Yes, there are things for append blobs for appending files when you’re copying them up to Azure, but there’s no clear-cut command that says take a “filesystem/subdirectory/file pattern and merge the results.” And even if there was, guess what: you have to touch all the files, or read them all, and then append the results.
  • The same thing goes for Azure Data Factory; yes, you can create a dataset that references one or more files, and then with a copy activity change the behavior to merge the files. The problem is that Data Factory has to read every single line, render it as a row, and append the rows for each file. It’s not fast by any means. Oh, and it’s not cheap either, and don’t get me started on how much I hate data flows, either.
  • With Spark, things get a little better: you can “chunk” the data up and read it faster, but again, you’re stuck with Spark having to fully read each file, render it as a dataframe, and then write it out. And as much as I love spark, getting a distributed compute framework to write one single file is, uh, fun. Still, this can be a cheaper route than Azure Data Factory, but not at some crazy ratio. Plus, you have to, you know, know how to write Spark code.

Oh and one other requirement? It has to be fast; if it’s going to take hours, then it’s not going to work for them. They want an efficient process that is as fast as possible, which (usually, but not always) means “as cheap as possible on cloud resources.”

cat got your tongue?

Linux, as always, has solved this problem for us. The cat command is a miracle of computing that allows you to quickly and effectively get the contents of files and stream them to one target file. Best of all, it’s fast. For the sake of testing, I can take 20,000 512KB (or, roughly 10GB total) files and combine them into one:

time cat evdata* > combined.csv

And the results are pretty good; my baseline for this activity takes roughly 1.75 minutes to merge the files (measured via the Linux time command):


But go back to the start of this blog post: My files aren’t in my local file system, so I can’t take advantage of cat, right?

blobfuse2: Electric Boogaloo

That’s where you’re wrong, kiddo: Microsoft maintains a virtual filesystem driver called blobfuse2 that allows you to mount your Azure Storage Accounts to Linux. Using this utility, we can mount the storage account in question, and then use the cat command to merge our files as if we were accessing the files locally. Before we get into a demo though, understand that you can’t beat physics: using this utility to access a cloud resource with a non-cloud resource (like a local VM) might work, but you’d be missing the point here: For this scenario, we’re going to use an Azure Virtual Machine to do the work (and also allow us to take advantage of Managed Identities to access the storage).

So how do you do this? Just follow these five easy steps:

  1. Spin up a Linux VM in Azure. For testing purposes, I’m just using a general purpose D2s v3 series VM. blobfuse2 does do well with more memory and compute, so depending on your data sizes and volume, you can see better performance with a bigger VM. I don’t have super prescriptive guidance here; test it yourself and see.
  2. Once your VM is up and running, enable a (system) managed identity. Do it through the portal, PowerShell, whatever, just do it:



    Then, in the access management blade of your storage account you want to mount, make sure you grant the identity of your VM “Storage Blob Data Contributor” access. Then go grab a drink, as it may take a few minutes for that change to take effect:



  3. Next, use a package manager (apt on Ubuntu, rpm on Fedora, etc.) to install blobfuse2 from the Microsoft repository:

    sudo wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb
    sudo dpkg -i packages-microsoft-prod.deb
    sudo apt-get update
    sudo apt-get install libfuse3-dev fuse3
    sudo apt-get install blobfuse2


  4. We should also create a local cache folder that we can use for our copy; you’ll see this path in our configuration file as a location we’ll configure. You’ll also need a local directory to mount the filesystem in (I’m just doing this in my local home folder for sake of brevity, but you can create it wherever you see fit and have permissions):

    mkdir .azcache
  5. Create your blobfuse2 configuration file. This is the most crucial (and confusing step): you need a configuration file that is used by blobfuse2 daemon to connect and authenticate to your storage account that you want to mount. There are plenty of sample files out there, but this is what worked for me. There are A LOT of other options here, from proxy settings, cache sizes, block sizes… you may (or may not) need to adjust to test this out, but I’ll leave that to you. Once you have it created, go ahead and save it as a .yaml file, and we’ll need it when we mount the storage account.

    Note the authentication mode here: msi. That means we don’t need to mess around with any credentials in our files, we’re just going to grant our VM’s identity access to the storage account.
logging:
  type: syslog
  level: log_debug

components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage

libfuse:
  attribute-expiration-sec: 120
  entry-expiration-sec: 120
  negative-entry-expiration-sec: 240

file_cache:
  path: /home/drew/.azcache
  timeout-sec: 120
  max-size-mb: 4096

attr_cache:
  timeout-sec: 7200

azstorage:
  type: adls
  account-name: dublindatasolutions
  endpoint: https://dublindatasolutions.dfs.core.windows.net
  mode: msi
  container: incoming

Taking a closer look here, what does this file actually do? It the file responsible for configuring how blobfuse2 will mount our storage. As I mentioned, there are a lot of additional configuration options and features to the tool, and undoubtedly there are more knobs and levers you can turn if you’re thirsty for more performance and/or control. The key parts of this file, aside from the managed identity authentication I mentioned already, is what storage account and container we’re connecting to, and where we want to set our local cache to. This is a great starting point if you want to try it yourself, just update your file_cache location to your home directory and folder, as well as your storage account details.

  1. Over in the access management blade of your storage account you want to mount, make sure you grant the identity of your VM “Storage Blob Data Contributor” access. Then go grab a drink, as it may take a few minutes for that change to take effect.

If everything worked, it’s time to check the mount. In your terminal window, type the following:

blobfuse2 mount ./az --config-file ./baseConfig.yaml

All we have to provide to the command is what subfolder we want to mount our storage container(s) in, and the path to the config file we saved. If everything worked, you should get dumped right back to a prompt. Now, let’s explore our mount directory we created, and:


We’re in. In this example, what you see here is everything that exists in the storage container inside the storage account I mounted. You can also mount all your containers on a storage account too, with mount all, if you prefer. The files I want to merge is in the ev_data folder in this example, and I’m not sure what a screenshot of 20,000 files would be, so that’s all you get.

Now that we’ve got our filesystem mounted, we can go ahead and run our cat command to merge the files. You may be wondering how it performs? Well, using my same baseline test as above, here’s what I see, performance wise:


Almost 20 minutes to merge 10GB of files; I know that’s a lot slower than doing it locally, but like I said: we can’t beat the speed of light here with the VM (even in Azure) going back and forth to a storage endpoint. That said, compared to other methods, setup aside this is probably going to be the most performant method for you.

And the best part? If you factor in the time it takes to run my VM to do this test, it cost me. Also, since the VM and storage are the same region there’s no egress cost, either. Rounding up to the nearest hour, this entire operation cost me a whopping $0.10.

“But what about header rows?”

Shortly after this post originally went live, someone asked the question, what about header rows in your CSVs? Don’t they get all merged together? Or what if there aren’t headers, but you want to add a header row?

To be clear: the original intent of this post was to highlight how cool blobfuse2 was, but it’s a valid question. The problem statement that stirred this whole post was because the data didn’t have any header rows at all, nor did it need any. The data exchange that used the combined files had a pre-defined schema that was applied when it was read back into the target system, so it wasn’t a big deal (as in, they would have skipped a header row if it was there). You might have different requirements though, so let’s discuss some options.

Again, Linux to the rescue here: instead of cat let’s use tail. Tail is like cat except that it reads the file in reverse, and if you want to skip the reader rows, you just read all the rows except one (since it reads it in reverse). The trick is to use -n parameter to tell it what line to start it, and we’ll use +2 to tell it to start at line number 2 (since line number 1 is the header). Which is just awesome if you’re used to zero-based indexes your whole damned life

Check out the results of cat vs tail below in my screen shot:

You can see the first, regular command prints out the whole file, while the latter prints out everything after the first line. Even though tail reads from the bottom-up, it still outputs the file in order.

And if you don’t have a header and you want to add one? Just create a new file with one row that has your header and then make it part of your folder that you want to combine your files in. It’ll process in whatever name order your folder is sorted it, so just make sure it’s the “first” file.

Sometimes the old methods are best

When it comes to complex problems, you don’t always need the most modern solution. Commands like cat are really good at what they do. Instead of developing a whole new data engineering process to meet your needs, sometimes it’s just best to go with what works. And thanks to tools like blobfuse2, that process gets even easier in our increasingly cloudy world. Whether this is a one-time exercise to merge some data, or something you wanted to automate on an ongoing basis, it’s a strong, efficient, and cost-effective solution to a problem you’d normally throw a lot of money at.

The biggest question you might have after reading this is: how could you operationalize this as part of a data engineering pipeline, or some other process? Since this is an OS-level command, you’d need a tool that can connect and run such a command on your virtual machine. It would also depend heavily on what kind of workflow or data engineering workflow tools you use; I certainly can’t hazard too many guesses but there are lots of ways to solve that problem.

The other thing to call out here is that maybe this isn’t something you even need to do; the person I was helping with this has a super-specific requirement for this. The fact is that for most data ingestion pipelines, one file instead of multiple files isn’t really “a thing.”

And finally, on performance: Remember, in Azure, the more you spend on a VM, the better performance you get, and not just compute. Network performance is also tied to your VM class and size. Therefore, you may see better performance if you pony up a little more cash.

So, if you find yourself in a place where you need to concatenate a bunch of files in your cloud storage account and you want to do it quickly and efficiently, you may want to give this method a try.

One thought on “Fast Merging of Files in Azure Data Lake Gen2

  1. Pingback: Concatenating Many Files in Azure Blob Storage – Curated SQL

Comments are closed.