eclipse中如何运行spark机器学习代码
发表于:2025-12-03 作者:千家信息网编辑
千家信息网最后更新 2025年12月03日,这篇文章主要介绍eclipse中如何运行spark机器学习代码,文中介绍的非常详细,具有一定的参考价值,感兴趣的小伙伴们一定要看完!直接在eclipse运行,不需要hadoop,不需要搭建spark,
千家信息网最后更新 2025年12月03日eclipse中如何运行spark机器学习代码
这篇文章主要介绍eclipse中如何运行spark机器学习代码,文中介绍的非常详细,具有一定的参考价值,感兴趣的小伙伴们一定要看完!
直接在eclipse运行,不需要hadoop,不需要搭建spark,只需要pom.xml中的依赖完整
import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.mllib.classification.LogisticRegressionWithSGDimport org.apache.spark.mllib.feature.HashingTFimport org.apache.spark.mllib.regression.LabeledPointobject MLlib { def main(args: Array[String]) { val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]") val sc = new SparkContext(conf) // Load 2 types of emails from text files: spam and ham (non-spam). // Each line has text from one email. val spam = sc.textFile("file:/Users/xxx/Documents/hadoopTools/scala/eclipse/Eclipse.app/Contents/MacOS/workspace/spark_ml/src/main/resources/files/spam.txt") val ham = sc.textFile("file:/Users/xxx/Documents/hadoopTools/scala/eclipse/Eclipse.app/Contents/MacOS/workspace/spark_ml/src/main/resources/files/ham.txt") // val abc=sc.parallelize(seq, 2) // Create a HashingTF instance to map email text to vectors of 100 features. val tf = new HashingTF(numFeatures = 100) // Each email is split into words, and each word is mapped to one feature. val spamFeatures = spam.map(email => tf.transform(email.split(" "))) val hamFeatures = ham.map(email => tf.transform(email.split(" "))) // Create LabeledPoint datasets for positive (spam) and negative (ham) examples. val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features)) val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features)) val trainingData = positiveExamples ++ negativeExamples trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm. // Create a Logistic Regression learner which uses the LBFGS optimizer. val lrLearner = new LogisticRegressionWithSGD() // Run the actual learning algorithm on the training data. val model = lrLearner.run(trainingData) // Test on a positive example (spam) and a negative one (ham). // First apply the same HashingTF feature transformation used on the training data. val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" ")) val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" ")) // Now use the learned model to predict spam/ham for new emails. println(s"Prediction for positive test example: ${model.predict(posTestExample)}") println(s"Prediction for negative test example: ${model.predict(negTestExample)}") sc.stop() }}sc.textFile里的参数是文件在本地的绝对路径。
setMaster("local[2]") 表示是本地运行,只使用两个核
HashingTF 用来从文档中创建词条目的频率特征向量,这里设置维度为100.
TF-IDF(Term frequency-inverse document frequency ) 是文本挖掘中一种广泛使用的特征向量化方法。TF-IDF反映了语料中单词对文档的重要程度。假设单词用t表示,文档用d表示,语料用D表示,那么文档频度DF(t, D)是包含单词t的文档数。如果我们只是使用词频度量重要性,就会很容易过分强调重负次数多但携带信息少的单词,例如:"a", "the"以及"of"。如果某个单词在整个语料库中高频出现,意味着它没有携带专门针对某特殊文档的信息。逆文档频度(IDF)是单词携带信息量的数值度量。
pom.xml
4.0.0 com.yanan.spark_maven spark1.3.1 0.0.1-SNAPSHOT jar spark_maven http://maven.apache.org UTF-8 1.9.13 junit junit 3.8.1 test org.scala-lang scala-library 2.10.4 org.apache.spark spark-core_2.10 1.3.1 org.apache.spark spark-mllib_2.10 1.3.1 org.scala-tools maven-scala-plugin compile testCompile scala-tools.org Scala-tools Maven2 Repository http://scala-tools.org/repo-releases cloudera-repo-releases https://repository.cloudera.com/artifactory/repo/
ham.txt
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014! Check out videos of talks from the summit at ...Hi Mom, Apologies for being late about emailing and forgetting to send you the package. I hope you and bro have been ...Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to take time to try it out immediately ...Hi Spark user list, This is my first question to this list, so thanks in advance for your help! I tried running ...Thanks Tom for your email. I need to refer you to Alice for this one. I haven't yet figured out that part either ...Good job yesterday! I was attending your talk, and really enjoyed it. I want to try out GraphX ...Summit demo got whoops from audience! Had to let you know. --Joe
spam.txt
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to send you money via wire transfer so please ...Get Vi_agra real cheap! Send money right away to ...Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to this email with your password and SSN ...THIS IS NOT A SCAM! Send money and get access to awesome stuff really cheap and never have to ...
Vi_agra 本来是去掉下划线的
以上是"eclipse中如何运行spark机器学习代码"这篇文章的所有内容,感谢各位的阅读!希望分享的内容对大家有帮助,更多相关知识,欢迎关注行业资讯频道!
文档
单词
运行
信息
语料
代码
机器
学习
重要
内容
特征
篇文章
频度
特殊
下划线
两个
价值
信息量
兴趣
参数
数据库的安全要保护哪些东西
数据库安全各自的含义是什么
生产安全数据库录入
数据库的安全性及管理
数据库安全策略包含哪些
海淀数据库安全审计系统
建立农村房屋安全信息数据库
易用的数据库客户端支持安全管理
连接数据库失败ssl安全错误
数据库的锁怎样保障安全
教务管理系统乱码服务器不可
互联网科技公司运作体系
db2数据库锁等待
六安网络安全考试学习
湖南自驾友网络技术有限公司
建立数据库怎么弄外码
网络安全教育公益讲座临汾市
数字网络安全失效应急预案演练
根服务器的发明者
建立数据库连接出错
软件开发增值税税率表
怎么去核算数据库的数据量
sql数据库 同步
宝安做商城软件开发哪家便宜
青岛宝拓网络技术有限公司百度
公共服务占gdp比重数据库
四川电信网络技术支撑专业考试
网络安全疫情期间心得
指纹机跨年导不出数据库
错题管理的网络技术
数据库时间参数怎么输入
支付宝服务器怎么升级
kvm控制另一台服务器
国家网络安全公开课心得体会
软件开发微交易
网络安全产品运维手册
反驳维护网络安全靠技术
怎样用迅雷上传数据库
工行软件开发中心西安招聘
软件开发笔记本对显卡要求