第57课:Spark SQL on Hive配置及实战
发表于:2025-12-03 作者:千家信息网编辑
千家信息网最后更新 2025年12月03日,1,首先需要安装hive,参考http://lqding.blog.51cto.com/9123978/17509672,在spark的配置目录下添加配置文件,让Spark可以访问hive的metas
千家信息网最后更新 2025年12月03日第57课:Spark SQL on Hive配置及实战
1,首先需要安装hive,参考http://lqding.blog.51cto.com/9123978/1750967
2,在spark的配置目录下添加配置文件,让Spark可以访问hive的metastore。
root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vi hive-site.xmlhive.metastore.uris thrift://spark-master:9083 Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.
3,将MySQL jdbc驱动copy到spark的lib目录下
root@spark-master:/usr/local/hive/apache-hive-1.2.1/lib# cp mysql-connector-java-5.1.36-bin.jar /usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/
4,启动Hive的metastore服务
root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# ./hive --service metastore &[1] 20518root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]Starting Hive Metastore ServerSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
5,启动spark-shell
root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-shell --master spark://spark-master:7077
生成hiveContext
scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc);
执行sql
scala> hc.sql("show tables").collect.foreach(println)[sougou,false][t1,false]scala> hc.sql("select count(*) from sougou").collect.foreach(println)16/03/14 23:15:58 INFO parse.ParseDriver: Parsing command: select count(*) from sougou16/03/14 23:16:00 INFO parse.ParseDriver: Parse Completed16/03/14 23:16:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 474.9 KB, free 474.9 KB)16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 41.6 KB, free 516.4 KB)16/03/14 23:16:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.199.100:41635 (size: 41.6 KB, free: 517.4 MB)16/03/14 23:16:02 INFO spark.SparkContext: Created broadcast 0 from collect at :3016/03/14 23:16:03 INFO mapred.FileInputFormat: Total input paths to process : 116/03/14 23:16:03 INFO spark.SparkContext: Starting job: collect at :3016/03/14 23:16:03 INFO scheduler.DAGScheduler: Registering RDD 5 (collect at :30)16/03/14 23:16:03 INFO scheduler.DAGScheduler: Got job 0 (collect at :30) with 1 output partitions16/03/14 23:16:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at :30)16/03/14 23:16:03 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)16/03/14 23:16:04 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)16/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at :30), which has no missing parents16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.8 KB, free 530.2 KB)16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.9 KB, free 537.1 KB)16/03/14 23:16:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.199.100:41635 (size: 6.9 KB, free: 517.4 MB)16/03/14 23:16:04 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at :30)16/03/14 23:16:04 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, spark-worker2, partition 0,NODE_LOCAL, 2152 bytes)16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, spark-worker1, partition 1,NODE_LOCAL, 2152 bytes)16/03/14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker2:55899 (size: 6.9 KB, free: 146.2 MB)16/03/14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker1:38231 (size: 6.9 KB, free: 146.2 MB)16/03/14 23:16:09 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker1:38231 (size: 41.6 KB, free: 146.2 MB)16/03/14 23:16:10 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker2:55899 (size: 41.6 KB, free: 146.2 MB)16/03/14 23:16:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 12015 ms on spark-worker1 (1/2)16/03/14 23:16:16 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (collect at :30) finished in 12.351 s16/03/14 23:16:16 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 12341 ms on spark-worker2 (2/2)16/03/14 23:16:16 INFO scheduler.DAGScheduler: looking for newly runnable stages16/03/14 23:16:16 INFO scheduler.DAGScheduler: running: Set()16/03/14 23:16:16 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)16/03/14 23:16:16 INFO scheduler.DAGScheduler: failed: Set()16/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at collect at :30), which has no missing parents16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/03/14 23:16:16 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.9 KB, free 550.1 KB)16/03/14 23:16:16 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 6.4 KB, free 556.5 KB)16/03/14 23:16:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.199.100:41635 (size: 6.4 KB, free: 517.4 MB)16/03/14 23:16:16 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:100616/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at collect at :30)16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks16/03/14 23:16:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, spark-worker1, partition 0,NODE_LOCAL, 1999 bytes)16/03/14 23:16:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on spark-worker1:38231 (size: 6.4 KB, free: 146.1 MB)16/03/14 23:16:17 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to spark-worker1:4356816/03/14 23:16:17 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158 bytes16/03/14 23:16:18 INFO scheduler.DAGScheduler: ResultStage 1 (collect at :30) finished in 1.288 s16/03/14 23:16:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1279 ms on spark-worker1 (1/1)16/03/14 23:16:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/03/14 23:16:18 INFO scheduler.DAGScheduler: Job 0 finished: collect at :30, took 14.285673 s[1000000] 跟Hive相比,速度是有所提升的。如果是复杂的语句,相比hive速度将更加的快。
scala> hc.sql("select word,count(*) cnt from sougou group by word order by cnt desc limit 5").collect.foreach(println)....16/03/14 23:19:16 INFO scheduler.DAGScheduler: ResultStage 3 (collect at :30) finished in 11.900 s16/03/14 23:19:16 INFO scheduler.DAGScheduler: Job 1 finished: collect at :30, took 17.925094 s16/03/14 23:19:16 INFO scheduler.TaskSetManager: Finished task 195.0 in stage 3.0 (TID 200) in 696 ms on spark-worker2 (200/200)16/03/14 23:19:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool [百度,7564][baidu,3652][人体艺术,2786][馆陶县县长闫宁的父亲,2388][4399小游戏,2119] 之前使用Hive需要跑将近110s,而使用Spark SQL仅需17s
配置
目录
速度
复杂
人体
人体艺术
仅需
县长
小游戏
文件
父亲
艺术
语句
馆陶
馆陶县
参考
服务
生成
驱动
实战
数据库的安全要保护哪些东西
数据库安全各自的含义是什么
生产安全数据库录入
数据库的安全性及管理
数据库安全策略包含哪些
海淀数据库安全审计系统
建立农村房屋安全信息数据库
易用的数据库客户端支持安全管理
连接数据库失败ssl安全错误
数据库的锁怎样保障安全
苏州软件开发用途
刺绣纹样艺术数据库建设
web服务器日常管理
人才数据库多久更新一次
甘肃社保软件开发
计算机二级软件开发知识点
国泰安数据库数研通
商洛市网络安全和信息化管理办法
解决网络安全威胁
烟草七种网络安全重大事故
大学做软件开发的网站
修改系统时间是否重启数据库服务
深信服网络安全费用
软件开发项目发包管理办法
大同学习网络技术
服务器怎么变成圣子
深圳大学软件开发专业介绍
软件开发好学习吗
数据库子句函数
网络安全知识2017年
科技属于互联网领域不
软件开发者选项权限
服务器怎么变成圣子
超微服务器管理默认密码
控制类软件开发技术
南方gps服务器
电脑可以被称为服务器吗
无锡银联计算机网络技术服务优势
睿格科技互联网
奉贤区方便软件开发包括什么