How to mount Azure Data Lake Storage Gen2 in Linux

Sometimes it's needed to fit a new brick into the old wall. For me it was a need to use an incredibly old Pentaho ETL with a brand-new Azure Data Lake Storage Gen2 without changing any pipeline. The old storage was based on SFTP and mounted in a local filesystem on the ETL machine. The machine is a CentOS 6.5 without an option to upgrade (for some reason - not really matters what). Of course, the solution you will see below will also work for newer OSes - you just need to change the repos.

Install Cloudera CDH

CDH (Cloudera Distribution Hadoop) is an open-source Apache Hadoop distribution provided by Cloudera Inc which is a Palo Alto-based American enterprise software company. CDH (Cloudera's Distribution Including Apache Hadoop) is the most complete, tested, and widely deployed distribution of Apache Hadoop.

Configure HDFS

Edit /etc/hadoop/conf/core-site.xml config file using editor you like, filling-up {{AAD_tenant_ID}}, {{client_id}} and {{client_secret}} variables (Service Principal credentials) in the template below:

Mount HDFS endpoint

Now you can mount your ADLS Gen2 HDFS endpoint in your filesystem filling-up {{storage_account}}, {{container/fs}} and {{mount_point}} variables in command template below:

hadoop-fuse-dfs abfss://{{storage_account}}@{{container/fs}}.dfs.core.windows.net /{{mount_point}}

You (probably) want CentOS to mount your ADLS Gen2 HDFS endpoint on every startup. You can do it using /etc/fstab:

hadoop-fuse-dfs#abfss://{{storage_account}}@{{container/fs}}.dfs.core.windows.net {{mount_point}} fuse allow_other,usetrash,rw 2 0 or similar.

In my case it does not apply because I have few filesystems, I need to mount in each other, so the order of mounting matters. To handle that I added hadoop-fuse-dfs commands in proper order to /etc/rc.local script.

Optimizing Mountable HDFS

As you can find in CDH documentation:

  • Cloudera recommends that you use the -obig_writes option on kernels later than 2.6.26. This option allows for better performance of writes.
  • By default, the CDH package installation creates the /etc/default/hadoop-fuse file with a maximum heap size of 128 MB. You might need to change the JVM minimum and maximum heap size for better performance. For example: export LIBHDFS_OPTS="-Xms64m -Xmx256m". Be careful not to set the minimum to a higher value than the maximum.