Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS.
Below is the architecture of our environment.
Below is the command to invoke presto…
/presto –server XX.X.X.XX:9080 –catalog hive
There is no presto user exists in my Hadoop environment. Everything is working well as per documentation and Presto Machine/CLI can query data from Hive database.
But missing information in documents and question is with what Hadoop user, Presto is connected to?
Presto is using hiveServer2, and data present in Hive. I dig further in Ambari configurations and found that hive.server2.enable.doAs is set to “false” which means is that Hiveserver2 will run MR jobs in HDFS as “hive” user. Permissions in HDFS files related to Hive can be given only to “hive” users. We can call this configuration as HiveServer2 access with limited HDFS access. This default configuration shows that data is visible to any system on same network with just Presto(any other connector using HiveServer2).
Now let us says we would like to protect our data, the best way to protect Hive CLI would be to enable permissions for HDFS files/folders mapped to the Hive database and tables.
The other option to protect our data over Hiverserver2 is using ranger hive plugin and In order to secure metastore, it is also recommended to turn on storage-based authorization. Below are configuration changes: hive-site.xml or In Ambari -> Hive-> Config, ensure the hive.server2.enable.doAs is set to “true”.
What this means is that Hiveserver2 will run MR jobs in HDFS as the original user. Make sure to restart Hive service in Ambari after changing any configuration. In Ranger, within HDFS, create permissions for files pertaining to hive tables. Provide appropriate permission to the file corresponding to the Hive table. The users can access data through HDFS commands as well. Check the audit logs in Ranger. You will see audit entries in Hive and HDFS with the original user’s ID.
Happy Data security!!!