Linking Cluster Storage Layers
For the hms-mirror
process to work, it relies on the RIGHT clusters' ability to SEE and ACCESS data in the LEFT clusters HDFS namespace. This is the same access/configuration required to support DISTCP for an HA environment and accounts for failovers.
We suggest that distcp
operations be run from the RIGHT cluster, which usually has the greater 'hdfs' version in a migration scenario.
The RIGHT cluster HCFS namespace requires access to the LEFT clusters HCFS namespace. RIGHT clusters with a greater HDFS version support LIMITED functionality for data access in the LEFT cluster.
NOTE: This isn't designed to be a permanent solution and should only be used for testing and migration purposes.
Goal
What does it take to support HDFS visibility between these two clusters?
Can that integration be used to support the Higher Clusters' use of the Lower Clusters HDFS Layer for distcp AND Hive External Table support?
Scenario #1
HDP 2.6.5 (Hadoop 2.7.x)
Kerberized - sharing same KDC as CDP Base Cluster
Configuration Changes
The namenode kerberos principal MUST be changed from nn
to hdfs
to match the namenode principal of the CDP cluster.
Note: You may need to add/adjust the auth_to_local
settings to match this change.
If this isn't done, spark-shell
and spark-submit
will fail to initialize. When changing this in Ambari on HDP, you will need to reset the HDFS zkfc ha
zNode in Zookeeper and reinitialize the hdfs zkfc
.
From a Zookeeper Client: /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server localhost
Initialize zkfc
core-site.xml
CDP 7.1.4 (Hadoop 3.1.x)
Kerberized, TLS Enabled
Configuration Changes
Requirements that allow this (upper) cluster to negotiate and communicate with the lower environment.
Cluster Wide hdfs-site.xml Safety Value
HDFS Service Advanced Config hdfs-site.xml
Running distcp
from the RIGHT Cluster
NOTE: Running distcp
from the LEFT cluster isn't supported since the hcfs client
is not forward compatible.
Copy 'from' Lower Cluster
Copy 'to' Lower Cluster
Sourcing Data from Lower Cluster to Support Upper Cluster External Tables
Proxy Permissions
The lower cluster must allow the upper clusters Hive Server 2 host as a 'hive' proxy. The setting in the lower clusters custom core-site.xml
may limit this to that clusters (lower) HS2 hosts. Open it up to include the upper clusters HS2 host.
Custom core-site.xml in Lower Cluster
Credentials from the 'upper' cluster will be projected down to the 'lower' cluster. The hive
user in the upper cluster, when running with 'non-impersonation' will require access to the datasets in the lower cluster HDFS.
For table creation in the 'upper' clusters Metastore, a permissions check will be done on the lower environments directory for the submitting user. So, both the service user AND hive
will require access to the directory location specified in the lower cluster.
When the two clusters share accounts, and the same accounts are used between environments for users and service accounts, then access should be simple.
When a different set of accounts are used, the 'principal' from the upper clusters service account for 'hive' and the 'user' principal will be used in the lower cluster. This means additional HDFS policies in the lower cluster may be required to support this cross-environment work.