Sometimes it's needed to fit a new brick into the old wall. For me it was a need to use an incredibly old Pentaho ETL with a brand-new Azure Data Lake Storage Gen2 without changing any pipeline. The old storage was based on SFTP and mounted in a local filesystem on the ETL machine. The machine is a CentOS 6.5 without an option to upgrade (for some reason - not really matters what). Of course, the solution you will see below will also work for newer OSes - you just need to change the repos.
Install Cloudera CDH
CDH (Cloudera Distribution Hadoop) is an open-source Apache Hadoop distribution provided by Cloudera Inc which is a Palo Alto-based American enterprise software company. CDH (Cloudera's Distribution Including Apache Hadoop) is the most complete, tested, and widely deployed distribution of Apache Hadoop.
Configure HDFS
Edit /etc/hadoop/conf/core-site.xml
config file using editor you like, filling-up {{AAD_tenant_ID}}
, {{client_id}}
and {{client_secret}}
variables (Service Principal credentials) in the template below:
Mount HDFS endpoint
Now you can mount your ADLS Gen2 HDFS endpoint in your filesystem filling-up {{storage_account}}
, {{container/fs}}
and {{mount_point}}
variables in command template below:
hadoop-fuse-dfs abfss://{{storage_account}}@{{container/fs}}.dfs.core.windows.net /{{mount_point}}
You (probably) want CentOS to mount your ADLS Gen2 HDFS endpoint on every startup. You can do it using /etc/fstab
:
hadoop-fuse-dfs#abfss://{{storage_account}}@{{container/fs}}.dfs.core.windows.net {{mount_point}} fuse allow_other,usetrash,rw 2 0
or similar.
In my case it does not apply because I have few filesystems, I need to mount in each other, so the order of mounting matters. To handle that I added hadoop-fuse-dfs
commands in proper order to /etc/rc.local
script.
Optimizing Mountable HDFS
As you can find in CDH documentation:
- Cloudera recommends that you use the -obig_writes option on kernels later than 2.6.26. This option allows for better performance of writes.
- By default, the CDH package installation creates the /etc/default/hadoop-fuse file with a maximum heap size of 128 MB. You might need to change the JVM minimum and maximum heap size for better performance. For example:
export LIBHDFS_OPTS="-Xms64m -Xmx256m"
. Be careful not to set the minimum to a higher value than the maximum.