{"id":330015,"date":"2022-02-22T21:00:48","date_gmt":"2022-02-22T21:00:48","guid":{"rendered":"http:\/\/savepearlharbor.com\/?p=330015"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=330015","title":{"rendered":"<span>Apache Spark<\/span>"},"content":{"rendered":"<div><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<figure class=\"full-width\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/03e\/908\/e97\/03e908e97c5ea5ad2cd93f0c24d7e639.png\" width=\"780\" height=\"439\" data-src=\"https:\/\/habrastorage.org\/getpro\/habr\/upload_files\/03e\/908\/e97\/03e908e97c5ea5ad2cd93f0c24d7e639.png\"\/><figcaption><\/figcaption><\/figure>\n<p>\u041f\u0440\u0438\u0432\u0435\u0442, \u0425\u0430\u0431\u0440. \u0414\u0435\u043b\u0438\u043c\u0441\u044f \u0430\u0432\u0442\u043e\u0440\u0441\u043a\u043e\u0439 \u0441\u0442\u0430\u0442\u044c\u0435\u0439 \u043f\u0440\u0435\u043f\u043e\u0434\u0430\u0432\u0430\u0442\u0435\u043b\u044f OTUS \u0412\u0430\u0434\u0438\u043c\u0430 \u0417\u0430\u0438\u0433\u0440\u0438\u043d\u0430.<\/p>\n<h3>Apache Spark<\/h3>\n<p><a href=\"http:\/\/spark.apache.org\/\"><u>Apache Spark<\/u><\/a>\u00a0\u2013 \u044d\u0442\u043e \u0440\u0430\u0441\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u0439 \u0444\u0440\u0435\u0439\u043c\u0432\u043e\u0440\u043a \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u0434\u0430\u043d\u043d\u044b\u0445, \u0441\u0442\u0430\u0432\u0448\u0438\u0439 \u0434\u0435-\u0444\u0430\u043a\u0442\u043e \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043e\u043c \u0432 \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0435 \u0431\u043e\u043b\u044c\u0448\u0438\u0445 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>Spark \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u0438\u0445 \u043a\u043e\u043c\u043f\u043e\u043d\u0435\u043d\u0442\u043e\u0432, \u0432 \u0447\u0438\u0441\u043b\u043e, \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u0432\u0445\u043e\u0434\u0438\u0442 \u0438 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f.<\/p>\n<figure class=\"full-width\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/e61\/de0\/97f\/e61de097ff4f435c96268e7fff8bc02f.png\" alt=\"Spark stack\" title=\"Spark stack\" width=\"633\" height=\"298\" data-src=\"https:\/\/habrastorage.org\/getpro\/habr\/upload_files\/e61\/de0\/97f\/e61de097ff4f435c96268e7fff8bc02f.png\"\/><figcaption>Spark stack<\/figcaption><\/figure>\n<p><strong>Spark ML<\/strong>\u00a0\u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u0431\u0430\u0437\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0438\u043d\u0441\u0442\u0440\u0443\u043c\u0435\u043d\u0442\u043e\u0432 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f:<\/p>\n<ul>\n<li>\n<p>\u0410\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u044b, \u0442\u0430\u043a\u0438\u0435 \u043a\u0430\u043a \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u044f, \u0440\u0435\u0433\u0440\u0435\u0441\u0441\u0438\u044f, \u043a\u043b\u0430\u0441\u0442\u0435\u0440\u0438\u0437\u0430\u0446\u0438\u044f \u0438 \u0441\u043e\u0432\u043c\u0435\u0441\u0442\u043d\u0430\u044f \u0444\u0438\u043b\u044c\u0442\u0440\u0430\u0446\u0438\u044f.<\/p>\n<\/li>\n<li>\n<p>\u041c\u0435\u0442\u043e\u0434\u044b \u0440\u0430\u0431\u043e\u0442\u044b \u0441 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438.<\/p>\n<\/li>\n<li>\n<p>\u041a\u043e\u043d\u0432\u0435\u0439\u0435\u0440\u044b (pipelines).<\/p>\n<\/li>\n<li>\n<p>\u0421\u043e\u0445\u0440\u0430\u043d\u0435\u043d\u0438\u0435 \u0438 \u0437\u0430\u0433\u0440\u0443\u0437\u043a\u0430 \u043c\u043e\u0434\u0435\u043b\u0435\u0439 \u0438 \u043a\u043e\u043d\u0432\u0435\u0439\u0435\u0440\u043e\u0432.<\/p>\n<\/li>\n<li>\n<p>\u0423\u0442\u0438\u043b\u0438\u0442\u044b: \u043b\u0438\u043d\u0435\u0439\u043d\u0430\u044f \u0430\u043b\u0433\u0435\u0431\u0440\u0430, \u0441\u0442\u0430\u0442\u0438\u0441\u0442\u0438\u043a\u0430, \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430 \u0434\u0430\u043d\u043d\u044b\u0445 \u0438 \u0442.\u0434.<\/p>\n<\/li>\n<\/ul>\n<p>\u041f\u043e \u0441\u0440\u0430\u0432\u043d\u0435\u043d\u0438\u044e \u0441 \u0434\u0440\u0443\u0433\u0438\u043c\u0438 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430\u043c\u0438 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f, \u0442\u0430\u043a\u0438\u043c\u0438 \u043a\u0430\u043a\u00a0<a href=\"http:\/\/scikit-learn.org\/\"><u>scikit-learn<\/u><\/a>\u00a0\u043d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u043d\u0430\u0431\u043e\u0440 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u043e\u0432 \u0432 Spark ML \u0432\u044b\u0433\u043b\u044f\u0434\u0438\u0442 \u0441\u043a\u0440\u043e\u043c\u043d\u0435\u0435, \u043d\u043e \u043e\u043d \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0432\u0441\u0435 \u043e\u0441\u043d\u043e\u0432\u043d\u044b\u0435 \u043c\u0435\u0442\u043e\u0434\u044b. \u041a\u0440\u043e\u043c\u0435 \u0442\u043e\u0433\u043e, Spark ML \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u0434\u043e\u0431\u0430\u0432\u043b\u044f\u0442\u044c \u0441\u0432\u043e\u0438 \u043c\u0435\u0442\u043e\u0434\u044b \u0438 \u0440\u0435\u0430\u043b\u0438\u0437\u043e\u0432\u044b\u0432\u0430\u0442\u044c \u043d\u0435\u0434\u043e\u0441\u0442\u0430\u044e\u0449\u0438\u0435 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u044b.<\/p>\n<p><strong>Spark ML<\/strong>\u00a0\u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u0434\u0432\u0443\u0445 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a:<\/p>\n<ul>\n<li>\n<p>spark.ml \u2013 \u044d\u0442\u043e \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f, \u043e\u0441\u043d\u043e\u0432\u0430\u043d\u043d\u0430\u044f \u043d\u0430 DataFrame API;<\/p>\n<\/li>\n<li>\n<p>spark.mllib \u2013 \u043d\u0430 RDD API.<\/p>\n<\/li>\n<\/ul>\n<p>\u041d\u0430\u0447\u0438\u043d\u0430\u044f \u0441 \u0432\u0435\u0440\u0441\u0438\u0438 2.0 \u043e\u0441\u043d\u043e\u0432\u043d\u043e\u0439 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u043e\u0439 \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f spark.ml, \u043d\u043e \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430 spark.mllib \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0442\u0438\u043f\u044b \u0434\u0430\u043d\u043d\u044b\u0445, \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u043c\u044b\u0435 \u0432 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0435 spark.ml<\/p>\n<p>\u041e\u0431\u0430 \u0432\u0430\u0440\u0438\u0430\u043d\u0442\u0430 Spark ML \u0445\u043e\u0440\u043e\u0448\u043e \u043e\u043f\u0438\u0441\u0430\u043d\u044b \u0432\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-guide.html\"><u>\u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0438<\/u><\/a>. \u041d\u043e \u044f \u043d\u0435 \u0431\u0443\u0434\u0443 \u043f\u0435\u0440\u0435\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u0442\u044c \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u044e. \u0420\u0430\u0441\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043a\u0430\u043a \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441\u043e Spark ML \u043d\u0430 \u043a\u043e\u043d\u043a\u0440\u0435\u0442\u043d\u043e\u043c \u043f\u0440\u0438\u043c\u0435\u0440\u0435.<\/p>\n<h3>\u0417\u0430\u0433\u0440\u0443\u0436\u0430\u0435\u043c Spark<\/h3>\n<p>Spark \u043c\u043e\u0436\u043d\u043e \u0437\u0430\u043f\u0443\u0441\u0442\u0438\u0442\u044c \u0432 \u043b\u043e\u043a\u0430\u043b\u044c\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435, \u0431\u0435\u0437 \u043a\u043b\u0430\u0441\u0442\u0435\u0440\u0430. \u042d\u0442\u043e \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u043f\u043e\u0437\u043d\u0430\u043a\u043e\u043c\u0438\u0442\u0441\u044f \u0441 API, \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u043e\u0441\u043e\u0431\u0435\u043d\u043d\u043e\u0441\u0442\u0438 \u0440\u0430\u0431\u043e\u0442\u044b \u0441 \u043d\u0438\u043c.<\/p>\n<p>Spark \u0440\u0430\u0431\u043e\u0442\u0430\u0435\u0442 \u043d\u0430 JVM. \u041f\u043e\u044d\u0442\u043e\u043c\u0443 \u0434\u043b\u044f \u0437\u0430\u043f\u0443\u0441\u043a\u0430 \u0437\u0430\u0434\u0430\u043d\u0438\u0439 \u0438 \u0440\u0430\u0437\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u043f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0439 \u043d\u0430 \u043a\u043e\u043c\u043f\u044c\u044e\u0442\u0435\u0440\u0435 \u0434\u043e\u043b\u0436\u0435\u043d \u0431\u044b\u0442\u044c \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d JDK, \u043f\u0443\u0442\u044c \u043a\u00a0<em>java<\/em>\u00a0\u0434\u043e\u043b\u0436\u0435\u043d \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u044c\u0441\u044f \u0432 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439 PATH, \u0438 \u0434\u043e\u043b\u0436\u043d\u0430 \u0431\u044b\u0442\u044c \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0430 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f JAVA_HOME.<\/p>\n<p>\u0427\u0442\u043e \u0437\u0430\u043f\u0443\u0441\u0442\u0438\u0442\u044c Spark \u0432 \u043b\u043e\u043a\u0430\u043b\u044c\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435 \u043d\u0430\u0434\u043e \u043f\u0440\u043e\u0434\u0435\u043b\u0430\u0442\u044c \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0435\u0435:<\/p>\n<ol>\n<li>\n<p>C\u043a\u0430\u0447\u0430\u0442\u044c \u0434\u0438\u0441\u0442\u0440\u0438\u0431\u0443\u0442\u0438\u0432 Spark \u043d\u0430 \u0441\u0432\u043e\u0439 \u043a\u043e\u043c\u043f\u044c\u044e\u0442\u0435\u0440:\u00a0<a href=\"http:\/\/spark.apache.org\/downloads.html\"><u>http:\/\/spark.apache.org\/downloads.html<\/u><\/a><\/p>\n<ul>\n<li>\n<p>\u0418\u0437 \u0441\u043f\u0438\u0441\u043a\u0430 \u0432\u0435\u0440\u0441\u0438\u0439 \u043d\u0430\u0434\u043e \u0432\u044b\u0431\u0440\u0430\u0442\u044c \u0442\u0443, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u0443 \u0432\u0430\u0441 \u043d\u0430 \u0440\u0430\u0431\u043e\u0442\u0435. \u0415\u0441\u043b\u0438 \u043d\u0430 \u0440\u0430\u0431\u043e\u0442\u0435 Spark \u043d\u0435 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f, \u0430 \u0435\u0441\u0442\u044c \u043f\u043e\u0442\u0440\u0435\u0431\u043d\u043e\u0441\u0442\u044c \u0432 \u0435\u0433\u043e \u0438\u0437\u0443\u0447\u0435\u043d\u0438\u0438, \u0442\u043e \u043b\u0443\u0447\u0448\u0435 \u0441\u043a\u0430\u0447\u0438\u0432\u0430\u0442\u044c \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u044e\u044e \u0432\u0435\u0440\u0441\u0438\u044e.<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u043c\u0438\u043c\u043e \u0432\u0435\u0440\u0441\u0438\u0438 \u0441\u0430\u043c\u043e\u0433\u043e Spark \u0435\u0441\u0442\u044c \u0432\u044b\u0431\u043e\u0440 \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u043c\u044b\u0445 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a Hadoop. \u0422\u0430\u043a \u043a\u0430\u043a \u043c\u044b \u0441\u043e\u0431\u0438\u0440\u0430\u0435\u043c\u0441\u044f \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0442\u044c Spark \u043b\u043e\u043a\u0430\u043b\u044c\u043d\u043e, \u0442\u043e \u0432\u0430\u0440\u0438\u0430\u043d\u0442 \u201cPre-built with user-provided Apache Hadoop\u201d \u043d\u0430\u043c \u043d\u0435 \u043f\u043e\u0434\u0445\u043e\u0434\u0438\u0442, \u0442\u0430\u043a \u043a\u0430\u043a \u0432 \u044d\u0442\u043e\u043c \u0441\u043b\u0443\u0447\u0430\u0435 \u043f\u0440\u0438\u0434\u0451\u0442\u0441\u044f \u0441\u043a\u0430\u0447\u0438\u0432\u0430\u0442\u044c \u0438 \u0443\u0441\u0442\u0430\u043d\u0430\u0432\u043b\u0438\u0432\u0430\u0442\u044c \u0435\u0449\u0451 \u0438 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438 Hadoop. \u041d\u0430\u0434\u043e \u0432\u044b\u0431\u0440\u0430\u0442\u044c \u043e\u0434\u0438\u043d \u0438\u0437 \u201cPre-built for Apache Hadoop \u2026\u201d.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>\u0420\u0430\u0441\u043f\u0430\u043a\u043e\u0432\u0430\u0442\u044c \u0430\u0440\u0445\u0438\u0432, \u043d\u0430\u043f\u0440\u0438\u043c\u0435\u0440 \u0432 \u043f\u0430\u043f\u043a\u0443\u00a0<code>\/opt\/spark<\/code>.<\/p>\n<\/li>\n<li>\n<p>\u041f\u0440\u0438 \u0436\u0435\u043b\u0430\u043d\u0438\u0438 \u043c\u043e\u0436\u043d\u043e \u0438\u0437\u043c\u0435\u043d\u0438\u0442\u044c \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b, \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u043d\u044b\u0435 \u043f\u043e-\u0443\u043c\u043e\u043b\u0447\u0430\u043d\u0438\u044e. \u041e\u043d\u0438 \u043d\u0430\u0445\u043e\u0434\u044f\u0442\u0441\u044f \u0432 \u043f\u0430\u043f\u043a\u0435\u00a0<code>conf<\/code>:<\/p>\n<ul>\n<li>\n<p><em>log4j.properties<\/em>\u00a0\u2013 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u043b\u043e\u0433\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f (\u041d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u0437\u0430\u043c\u0435\u043d\u0438\u0442\u044c INFO \u043d\u0430 WARN);<\/p>\n<\/li>\n<li>\n<p><em>spark-defaults.conf<\/em>\u00a0\u2013 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b spark-submit (\u041d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u0443\u0432\u0435\u043b\u0438\u0447\u0438\u0442\u044c \u043f\u0430\u043c\u044f\u0442\u044c \u0434\u0440\u0430\u0439\u0432\u0435\u0440\u0430).<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>\u041f\u0440\u043e\u043f\u0438\u0441\u0430\u0442\u044c \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e SPARK_HOME<\/p>\n<\/li>\n<\/ol>\n<h3>\u0417\u0430\u043f\u0443\u0441\u043a\u0430\u0435\u043c Spark<\/h3>\n<p>\u041f\u0440\u0435\u0436\u0434\u0435, \u0447\u0435\u043c \u043f\u0438\u0441\u0430\u0442\u044c \u0438 \u043a\u043e\u043c\u043f\u0438\u043b\u0438\u0440\u043e\u0432\u0430\u0442\u044c \u043f\u0440\u043e\u0433\u0440\u0430\u043c\u043c\u0443 \u0434\u043b\u044f Spark, \u0436\u0435\u043b\u0430\u0442\u0435\u043b\u044c\u043d\u043e \u043f\u043e\u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441 \u043d\u0438\u043c \u0432 \u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435 (REPL).<\/p>\n<p>\u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0435\u0441\u0442\u044c \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u0432\u0430\u0440\u0438\u0430\u043d\u0442\u043e\u0432:<\/p>\n<ul>\n<li>\n<p>spark-shell (pyspark)\u041a\u043e\u043d\u0441\u043e\u043b\u044c\u043d\u044b\u0439 Scala\/Python REPL \u0441 \u043d\u0430\u0441\u0442\u0440\u043e\u0435\u043d\u043d\u044b\u043c Spark. \u0412\u0445\u043e\u0434\u0438\u0442 \u0432 \u0434\u0438\u0441\u0442\u0440\u0438\u0431\u0443\u0442\u0438\u0432 Spark. \u041d\u0435\u0443\u0434\u043e\u0431\u0435\u043d \u043f\u0440\u0438 \u0434\u043b\u0438\u0442\u0435\u043b\u044c\u043d\u043e\u0439 \u0440\u0430\u0431\u043e\u0442\u0435.<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/zeppelin.apache.org\/\"><u>Apache Zeppelin<\/u><\/a>\u0421\u0435\u0440\u0432\u0438\u0441 \u043d\u043e\u0443\u0442\u0431\u0443\u043a\u043e\u0432 \u0432 \u0431\u0440\u0430\u0443\u0437\u0435\u0440\u0435. \u041f\u043e\u0434\u0434\u0435\u0440\u0436\u0438\u0432\u0430\u0435\u0442 \u0431\u043e\u043b\u044c\u0448\u043e\u0435 \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u043e \u0438\u043d\u0442\u0435\u0440\u043f\u0440\u0435\u0442\u0430\u0442\u043e\u0440\u043e\u0432, \u0432\u043a\u043b\u044e\u0447\u0430\u044f Spark, Scala \u0438 Python. \u0423\u0434\u043e\u0431\u0435\u043d \u0442\u0435\u043c, \u0447\u0442\u043e \u043a\u0430\u043a \u0438 \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u0439 \u043a\u043e\u043d\u0441\u043e\u043b\u044c\u043d\u044b\u0439 REPL \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u043d\u0430\u0441\u0442\u0440\u043e\u0435\u043d\u043d\u044b\u0439 Spark.<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/livy.apache.org\/\"><u>Apache Livy<\/u><\/a>REST \u0441\u0435\u0440\u0432\u0438\u0441 \u0434\u043b\u044f Spark. \u041f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0442\u044c \u0437\u0430\u0434\u0430\u043d\u0438\u044f \u0438 \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u043e.<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/toree.apache.org\/\"><u>Apache Toree<\/u><\/a>\u042f\u0434\u0440\u043e \u0434\u043b\u044f Jupyter Notebook \u0434\u043b\u044f \u0440\u0430\u0431\u043e\u0442\u044b \u0441\u043e Spark.<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/almond.sh\/\"><u>Almond<\/u><\/a>Scala \u044f\u0434\u0440\u043e \u0434\u043b\u044f Jupyter. \u041f\u043e\u0434\u0434\u0435\u0440\u0436\u0438\u0432\u0430\u0435\u0442 Spark.<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/plugins.jetbrains.com\/plugin\/12494-big-data-tools\"><u>JetBrains Big Data Tools<\/u><\/a>\u041f\u043b\u0430\u0433\u0438\u043d \u0434\u043b\u044f IntelliJ\u00a0IDEA, DataGrip \u0438 PyCharm IDE \u043e\u0442 JetBrains. \u041f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u043f\u0440\u044f\u043c\u043e \u0438\u0437 IDE \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441 \u043d\u043e\u0443\u0442\u0431\u0443\u043a\u0430\u043c\u0438 Zeppelin, \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u0434\u043e\u0441\u0442\u0443\u043f \u043a \u043c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433\u0443 Spark \u0438 Kafka, \u0434\u043e\u0441\u0442\u0443\u043f \u043a HDFS \u0438 \u0442.\u043f.<\/p>\n<\/li>\n<\/ul>\n<p>\u041b\u0438\u0447\u043d\u043e \u044f \u043f\u0440\u0435\u0434\u043f\u043e\u0447\u0438\u0442\u0430\u044e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c Apache Zeppelin \u0432\u043c\u0435\u0441\u0442\u0435 \u0441 JetBrains Big Data Tools.<\/p>\n<h3>\u0417\u0430\u0434\u0430\u0447\u0430 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f<\/h3>\n<p>\u0412 \u043a\u0430\u0447\u0435\u0441\u0442\u0432\u0435 \u043f\u0440\u0438\u043c\u0435\u0440\u0430 \u0432\u043e\u0437\u044c\u043c\u0451\u043c \u0437\u0430\u0434\u0430\u0447\u0443 \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u0430\u043d\u0438\u044f \u043e\u0442\u0442\u043e\u043a\u0430 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432 \u0431\u0430\u043d\u043a\u0430.<br \/>\u041e\u043f\u0438\u0441\u0430\u043d\u0438\u0435 \u0437\u0430\u0434\u0430\u0447\u0438 \u0438 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f <a href=\"https:\/\/www.kaggle.com\/sakshigoyal7\/credit-card-customers\">\u043d\u0430 \u0441\u0430\u0439\u0442\u0435 Kaggle<\/a>.<\/p>\n<p>\u042d\u0442\u043e\u0442 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 10 000 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432 \u0438 \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0442\u0430\u043a\u0438\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438, \u043a\u0430\u043a \u0432\u043e\u0437\u0440\u0430\u0441\u0442, \u0437\u0430\u0440\u043f\u043b\u0430\u0442\u0430, \u0441\u0442\u0430\u0442\u0443\u0441 \u043f\u043e \u0441\u043e\u0441\u0442\u043e\u044f\u043d\u0438\u044e \u0437\u0434\u043e\u0440\u043e\u0432\u044c\u044f, \u043b\u0438\u043c\u0438\u0442 \u043a\u0440\u0435\u0434\u0438\u0442\u043d\u043e\u0439 \u043a\u0430\u0440\u0442\u044b, \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u044e \u043a\u0440\u0435\u0434\u0438\u0442\u043d\u043e\u0439 \u043a\u0430\u0440\u0442\u044b \u0438 \u0442.\u0434., \u0430 \u0442\u0430\u043a\u0436\u0435 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e\u00a0<code>Attrition_Flag<\/code>\u00a0\u0441 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u043c \u043e\u0442\u0442\u043e\u043a\u0430 (\u043f\u0435\u0440\u0435\u0441\u0442\u0430\u043b \u043b\u0438 \u043a\u043b\u0438\u0435\u043d\u0442 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u0441\u044f \u0443\u0441\u043b\u0443\u0433\u0430\u043c\u0438 \u0431\u0430\u043d\u043a\u0430).<\/p>\n<p>\u041c\u044b \u0440\u0435\u0448\u0430\u0435\u043c \u0437\u0430\u0434\u0430\u0447\u0443\u00a0<strong>\u0431\u0438\u043d\u0430\u0440\u043d\u043e\u0439 \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u0438<\/strong>. \u041d\u0430\u043c \u043d\u0430\u0434\u043e \u043f\u043e\u0441\u0442\u0440\u043e\u0438\u0442\u044c \u043c\u043e\u0434\u0435\u043b\u044c, \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u044e\u0449\u0443\u044e \u043a \u043a\u0430\u043a\u043e\u0439 \u0433\u0440\u0443\u043f\u043f\u0435 \u043e\u0442\u043d\u043e\u0441\u0438\u0442\u0441\u044f \u043a\u043b\u0438\u0435\u043d\u0442.<\/p>\n<h3>\u042d\u0442\u0430\u043f\u044b ML<\/h3>\n<p>\u0418\u0437 \u043a\u0430\u043a\u0438\u0445 \u0436\u0435 \u044d\u0442\u0430\u043f\u043e\u0432 \u0434\u043e\u043b\u0436\u0435\u043d \u0441\u043e\u0441\u0442\u043e\u044f\u0442\u044c \u043f\u0440\u043e\u0435\u043a\u0442 ML?<\/p>\n<p>\u0415\u0441\u0442\u044c \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u043b\u043e\u0433\u0438\u0439. \u0411\u0443\u0434\u0435\u043c \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u00a0<a href=\"https:\/\/ru.wikipedia.org\/wiki\/CRISP-DM\"><u>CRISP-DM<\/u><\/a>.<\/p>\n<h4>CRISP-DM<\/h4>\n<p><strong>CRISP-DM<\/strong>\u00a0(<em>Cross-Industry Standard Process for Data Mining<\/em>)\u00a0\u2014 \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u0440\u0430\u0441\u043f\u0440\u043e\u0441\u0442\u0440\u0430\u043d\u0451\u043d\u043d\u0430\u044f \u043c\u0435\u0442\u043e\u0434\u043e\u043b\u043e\u0433\u0438\u044f \u043f\u043e \u0438\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u043d\u0438\u044e \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<figure class=\"full-width\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/f58\/735\/87b\/f5873587bffb46c898ee72ceeb8dade9.png\" alt=\"\" title=\"\" width=\"3707\" height=\"3714\" data-src=\"https:\/\/habrastorage.org\/getpro\/habr\/upload_files\/f58\/735\/87b\/f5873587bffb46c898ee72ceeb8dade9.png\"\/><figcaption><\/figcaption><\/figure>\n<p>\u0418\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 \u043f\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u043b\u043e\u0433\u0438\u0438 CRISP-DM \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u0445 \u0444\u0430\u0437:<\/p>\n<ol>\n<li>\n<p>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0431\u0438\u0437\u043d\u0435\u0441-\u0446\u0435\u043b\u0435\u0439 (<em>Business Understanding<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 (<em>Data Understanding<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043a\u0430 \u0434\u0430\u043d\u043d\u044b\u0445 (<em>Data Preparation<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041c\u043e\u0434\u0435\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 (<em>Modeling<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041e\u0446\u0435\u043d\u043a\u0430 (<em>Evaluation<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u0412\u043d\u0435\u0434\u0440\u0435\u043d\u0438\u0435 (<em>Deployment<\/em>).<\/p>\n<\/li>\n<\/ol>\n<p>\u0411\u0443\u0434\u0435\u043c \u0440\u0435\u0448\u0430\u0442\u044c \u043d\u0430\u0448\u0443 \u0437\u0430\u0434\u0430\u0447\u0443 \u043f\u043e \u044d\u0442\u0438\u043c \u0448\u0430\u0433\u0430\u043c.<\/p>\n<h3>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0431\u0438\u0437\u043d\u0435\u0441-\u0446\u0435\u043b\u0435\u0439<\/h3>\n<p>\u0421 \u0431\u0438\u0437\u043d\u0435\u0441-\u0446\u0435\u043b\u044f\u043c\u0438 \u0432 \u043d\u0430\u0448\u0435\u043c \u0441\u043b\u0443\u0447\u0430\u0435 \u0432\u0441\u0451 \u043f\u0440\u043e\u0441\u0442\u043e. \u0411\u0430\u043d\u043a \u0437\u0430\u0438\u043d\u0442\u0435\u0440\u0435\u0441\u043e\u0432\u0430\u043d \u0432 \u0441\u043e\u0445\u0440\u0430\u043d\u0435\u043d\u0438\u0438 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432. \u041f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u0430\u0432 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043e\u0442\u043d\u043e\u0441\u044f\u0442\u0441\u044f \u043a \u0433\u0440\u0443\u043f\u043f\u0435, \u0441\u043a\u043b\u043e\u043d\u043d\u043e\u0439 \u043a \u0443\u0445\u043e\u0434\u0443 \u0438\u0437 \u0431\u0430\u043d\u043a\u0430, \u043c\u043e\u0436\u043d\u043e \u0441\u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u043d\u0430 \u043e\u043f\u0435\u0440\u0435\u0436\u0435\u043d\u0438\u0435 \u0438 \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u0442\u044c \u0438\u043c \u0432\u044b\u0433\u043e\u0434\u043d\u044b\u0435 \u0443\u0441\u043b\u043e\u0432\u0438\u044f, \u0447\u0442\u043e\u0431\u044b \u043e\u043d\u0438 \u043e\u0441\u0442\u0430\u043b\u0438\u0441\u044c \u043a\u043b\u0438\u0435\u043d\u0442\u0430\u043c\u0438 \u0431\u0430\u043d\u043a\u0430.<\/p>\n<h3>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445<\/h3>\n<p>\u0414\u0430\u0432\u0430\u0439\u0442\u0435 \u0437\u0430\u0433\u0440\u0443\u0437\u0438\u043c \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0438 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u043d\u0435\u0433\u043e.<\/p>\n<p>\u0414\u0430\u043d\u043d\u044b\u0435 \u043d\u0430\u0445\u043e\u0434\u044f\u0442\u0441\u044f \u0432 \u0444\u0430\u0439\u043b\u0435 \u0432 \u0444\u043e\u0440\u043c\u0430\u0442\u0435 CSV. \u0417\u0430\u0433\u0440\u0443\u0437\u0438\u043c \u0435\u0433\u043e \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u043c \u0434\u043b\u044f Spark \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u043c \u0432 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e\u00a0<code>raw<\/code>\u00a0\u0442\u0438\u043f\u0430 DataFrame:<\/p>\n<pre><code>val raw = spark         .read         .option(\"header\", \"true\")         .option(\"inferSchema\", \"true\")         .csv(s\"$basePath\/data\/BankChurners.csv\")<\/code><\/pre>\n<p>\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>basePath<\/code>\u00a0\u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u043f\u0443\u0442\u044c \u043a \u0440\u0430\u0431\u043e\u0447\u0435\u043c\u0443 \u043a\u0430\u0442\u0430\u043b\u043e\u0433\u0443 \u044d\u0442\u043e\u0433\u043e \u043f\u0440\u043e\u0435\u043a\u0442\u0430.<\/p>\n<p>\u0412 \u043e\u043f\u0438\u0441\u0430\u043d\u0438\u0438 \u044d\u0442\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u0441\u043a\u0430\u0437\u0430\u043d\u043e: \u201cPLEASE IGNORE THE LAST 2 COLUMNS (NAIVE BAYES CLAS\u2026)\u201d. \u0410 \u043f\u0435\u0440\u0432\u0430\u044f \u043a\u043e\u043b\u043e\u043d\u043a\u0430 \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0443\u043d\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0439 \u0438\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440 \u043a\u043b\u0438\u0435\u043d\u0442\u0430, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u0434\u043b\u044f \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0438 \u0441\u043e\u0432\u0435\u0440\u0448\u0435\u043d\u043d\u043e \u043d\u0435 \u043d\u0443\u0436\u0435\u043d.<\/p>\n<p>\u041f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u043a\u043e\u043b\u043e\u043d\u043e\u043a, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043d\u0430\u0434\u043e \u0438\u0441\u043a\u043b\u044e\u0447\u0438\u0442\u044c \u0438\u0437 \u0437\u0430\u0433\u0440\u0443\u0436\u0435\u043d\u043d\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u2013 \u044d\u0442\u043e \u043f\u0435\u0440\u0432\u0430\u044f \u0438 \u0434\u0432\u0435 \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u0438\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438. \u041f\u043e\u043b\u0443\u0447\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0438\u0437 DataFrame, \u0432\u044b\u0434\u0435\u043b\u0438\u043c \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u0438\u0435 \u0434\u0432\u0430 \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u0430 \u0438 \u0434\u043e\u0431\u0430\u0432\u0438\u043c \u043f\u0435\u0440\u0432\u044b\u0439.<\/p>\n<pre><code>val columns: Array[String] = raw.columns val columnsLen: Int = columns.length val colsToDrop: Array[String] = columns.slice(columnsLen - 2, columnsLen) :+ columns.head<\/code><\/pre>\n<p>\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>colsToDrop<\/code>\u00a0\u2013 \u044d\u0442\u043e \u043c\u0430\u0441\u0441\u0438\u0432 \u0438\u043c\u0451\u043d \u043a\u043e\u043b\u043e\u043d\u043e\u043a, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043d\u0430\u0434\u043e \u0438\u0441\u043a\u043b\u044e\u0447\u0438\u0442\u044c \u0438\u0437 \u0437\u0430\u0433\u0440\u0443\u0436\u0435\u043d\u043d\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>\u0414\u043b\u044f \u0443\u0434\u0430\u043b\u0435\u043d\u0438\u044f \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0438\u0437 DataFrame \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043c\u0435\u0442\u043e\u0434\u00a0<code>drop<\/code>, \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u0430\u043c\u0438 \u043a\u043e\u0442\u043e\u0440\u043e\u0433\u043e \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f \u043e\u0434\u043d\u043e \u0438\u043b\u0438 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u0439 \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u2013 \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u044b \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439 \u0434\u043b\u0438\u043d\u044b. \u0427\u0442\u043e\u0431\u044b \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u044c \u043c\u0430\u0441\u0441\u0438\u0432 \u0432 \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u044b \u043c\u0435\u0442\u043e\u0434\u0430 \u0432 Scala \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0435\u0442\u0441\u044f \u043a\u043e\u043d\u0441\u0442\u0440\u0443\u043a\u0446\u0438\u044f\u00a0<code>array: _*<\/code><\/p>\n<pre><code>val df = raw.drop(colsToDrop: _*)<\/code><\/pre>\n<p>\u0418\u0442\u0430\u043a, \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>df<\/code>\u00a0\u0442\u0438\u043f\u0430 DataFrame \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0438\u0441\u0445\u043e\u0434\u043d\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0431\u0435\u0437 \u043f\u0435\u0440\u0432\u043e\u0439 \u0438 \u0434\u0432\u0443\u0445 \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u0438\u0445 \u043a\u043e\u043b\u043e\u043d\u043e\u043a. \u041f\u043e\u043b\u0435\u0437\u043d\u043e \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u043d\u0430 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043f\u0435\u0440\u0432\u044b\u0445 \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u044d\u0442\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430:<\/p>\n<pre><code>df.show(5, truncate = false)<\/code><\/pre>\n<pre><code>+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+ |Attrition_Flag   |Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio| +-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+ |Existing Customer|45          |M     |3              |High School    |Married       |$60K - $80K    |Blue         |39            |5                       |1                     |3                    |12691.0     |777                |11914.0        |1.335               |1144           |42            |1.625              |0.061                | |Existing Customer|49          |F     |5              |Graduate       |Single        |Less than $40K |Blue         |44            |6                       |1                     |2                    |8256.0      |864                |7392.0         |1.541               |1291           |33            |3.714              |0.105                | |Existing Customer|51          |M     |3              |Graduate       |Married       |$80K - $120K   |Blue         |36            |4                       |1                     |0                    |3418.0      |0                  |3418.0         |2.594               |1887           |20            |2.333              |0.0                  | |Existing Customer|40          |F     |4              |High School    |Unknown       |Less than $40K |Blue         |34            |3                       |4                     |1                    |3313.0      |2517               |796.0          |1.405               |1171           |20            |2.333              |0.76                 | |Existing Customer|40          |M     |3              |Uneducated     |Married       |$60K - $80K    |Blue         |21            |5                       |1                     |0                    |4716.0      |0                  |4716.0         |2.175               |816            |28            |2.5                |0.0                  | +-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+ only showing top 5 rows<\/code><\/pre>\n<h4>\u041e\u043f\u0440\u0435\u0434\u0435\u043b\u044f\u0435\u043c \u0442\u0438\u043f\u044b \u043a\u043e\u043b\u043e\u043d\u043e\u043a<\/h4>\n<p>\u0414\u043b\u044f \u043f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u044f \u0434\u0430\u043d\u043d\u044b\u0445 \u043f\u043e\u043b\u0435\u0437\u043d\u043e \u0443\u0437\u043d\u0430\u0442\u044c \u043a\u043e\u0433\u043e \u0442\u0438\u043f\u0430 \u043a\u043e\u043b\u043e\u043d\u043a\u0438 \u0435\u0441\u0442\u044c \u0432 \u043d\u0430\u0431\u043e\u0440\u0435 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>\u0427\u0430\u0449\u0435 \u0432\u0441\u0435\u0433\u043e \u0434\u043b\u044f \u0432\u044b\u0432\u043e\u0434\u0430 \u0441\u0445\u0435\u043c\u044b DataFrame \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043c\u0435\u0442\u043e\u0434\u00a0<code>printSchema<\/code>:<\/p>\n<pre><code>df.printSchema<\/code><\/pre>\n<pre><code>root  |-- Attrition_Flag: string (nullable = true)  |-- Customer_Age: integer (nullable = true)  |-- Gender: string (nullable = true)  |-- Dependent_count: integer (nullable = true)  |-- Education_Level: string (nullable = true)  |-- Marital_Status: string (nullable = true)  |-- Income_Category: string (nullable = true)  |-- Card_Category: string (nullable = true)  |-- Months_on_book: integer (nullable = true)  |-- Total_Relationship_Count: integer (nullable = true)  |-- Months_Inactive_12_mon: integer (nullable = true)  |-- Contacts_Count_12_mon: integer (nullable = true)  |-- Credit_Limit: double (nullable = true)  |-- Total_Revolving_Bal: integer (nullable = true)  |-- Avg_Open_To_Buy: double (nullable = true)  |-- Total_Amt_Chng_Q4_Q1: double (nullable = true)  |-- Total_Trans_Amt: integer (nullable = true)  |-- Total_Trans_Ct: integer (nullable = true)  |-- Total_Ct_Chng_Q4_Q1: double (nullable = true)  |-- Avg_Utilization_Ratio: double (nullable = true)<\/code><\/pre>\n<p>\u042d\u0442\u043e\u0442 \u043c\u0435\u0442\u043e\u0434 \u0445\u043e\u0440\u043e\u0448\u043e \u043f\u043e\u0434\u0445\u043e\u0434\u0438\u0442 \u0434\u043b\u044f \u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u043e\u0439 \u0440\u0430\u0431\u043e\u0442\u044b, \u043d\u043e \u0434\u043b\u044f \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u0430 \u043b\u0443\u0447\u0448\u0435 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c \u043c\u0435\u0442\u043e\u0434\u00a0<code>dtypes<\/code><\/p>\n<p>\u0412\u044b\u0432\u0435\u0434\u0435\u043c \u0432 \u0443\u0434\u043e\u0431\u043d\u043e\u043c \u0432\u0438\u0434\u0435 \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u044f \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0438 \u0438\u0445 \u0442\u0438\u043f:<\/p>\n<pre><code>df.dtypes.foreach { dt => println(f\"${dt._1}%25s\\t${dt._2}\") }<\/code><\/pre>\n<pre><code>           Attrition_FlagStringType              Customer_AgeIntegerType                    GenderStringType           Dependent_countIntegerType           Education_LevelStringType            Marital_StatusStringType           Income_CategoryStringType             Card_CategoryStringType            Months_on_bookIntegerType  Total_Relationship_CountIntegerType    Months_Inactive_12_monIntegerType     Contacts_Count_12_monIntegerType              Credit_LimitDoubleType       Total_Revolving_BalIntegerType           Avg_Open_To_BuyDoubleType      Total_Amt_Chng_Q4_Q1DoubleType           Total_Trans_AmtIntegerType            Total_Trans_CtIntegerType       Total_Ct_Chng_Q4_Q1DoubleType     Avg_Utilization_RatioDoubleType<\/code><\/pre>\n<p>\u0418 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u043a\u0430\u0436\u0434\u043e\u0433\u043e \u0442\u0438\u043f\u0430:<\/p>\n<pre><code>df.dtypes.groupBy(_._2).mapValues(_.length).foreach(println)<\/code><\/pre>\n<pre><code>(DoubleType,5) (StringType,6) (IntegerType,9)<\/code><\/pre>\n<h4>\u041f\u0440\u043e\u0432\u0435\u0440\u0438\u043c \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438<\/h4>\n<p>\u0412\u044b\u0434\u0435\u043b\u0438\u043c \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438 \u0438 \u043f\u0440\u0438\u043c\u0435\u043d\u0438\u043c \u043a \u043d\u0438\u043c \u043c\u0435\u0442\u043e\u0434\u00a0<code>summary<\/code>. \u042d\u0442\u043e\u0442 \u043c\u0435\u0442\u043e\u0434 \u0432\u044b\u0447\u0438\u0441\u043b\u044f\u0435\u0442 \u0442\u0430\u043a\u0438\u0435 \u0441\u0442\u0430\u0442\u0438\u0441\u0442\u0438\u043a\u0438 \u043a\u0430\u043a:<\/p>\n<ul>\n<li>\n<p>count<\/p>\n<\/li>\n<li>\n<p>mean<\/p>\n<\/li>\n<li>\n<p>stddev<\/p>\n<\/li>\n<li>\n<p>min<\/p>\n<\/li>\n<li>\n<p>max<\/p>\n<\/li>\n<li>\n<p>arbitrary approximate percentiles specified as a percentage (e.g. 75%)<\/p>\n<\/li>\n<\/ul>\n<pre><code>val numericColumns: Array[String] = df.dtypes.filter(!_._2.equals(\"StringType\")).map(_._1) df.select(numericColumns.map(col): _*).summary().show<\/code><\/pre>\n<pre><code>+-------+-----------------+------------------+------------------+------------------------+----------------------+---------------------+-----------------+-------------------+-----------------+--------------------+-----------------+-----------------+-------------------+---------------------+ |summary|     Customer_Age|   Dependent_count|    Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|     Credit_Limit|Total_Revolving_Bal|  Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|  Total_Trans_Amt|   Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio| +-------+-----------------+------------------+------------------+------------------------+----------------------+---------------------+-----------------+-------------------+-----------------+--------------------+-----------------+-----------------+-------------------+---------------------+ |  count|            10127|             10127|             10127|                   10127|                 10127|                10127|            10127|              10127|            10127|               10127|            10127|            10127|              10127|                10127| |   mean|46.32596030413745|2.3462032191172115|35.928409203120374|      3.8125802310654686|    2.3411671768539546|   2.4553174681544387|8631.953698034848| 1162.8140614199665|7469.139636614887|  0.7599406536980376|4404.086303939963|64.85869457884863| 0.7122223758269962|   0.2748935518909845| | stddev|8.016814032549046|  1.29890834890379|  7.98641633087208|        1.55440786533883|    1.0106223994182844|   1.1062251426359249|9088.776650223148|  814.9873352357533|9090.685323679114|  0.2192067692307027|3397.129253557085|23.47257044923301|0.23808609133294137|  0.27569146925238736| |    min|               26|                 0|                13|                       1|                     0|                    0|           1438.3|                  0|              3.0|                 0.0|              510|               10|                0.0|                  0.0| |    25%|               41|                 1|                31|                       3|                     2|                    2|           2555.0|                357|           1322.0|               0.631|             2155|               45|              0.581|                0.022| |    50%|               46|                 2|                36|                       4|                     2|                    2|           4549.0|               1276|           3472.0|               0.736|             3899|               67|              0.702|                0.175| |    75%|               52|                 3|                40|                       5|                     3|                    3|          11067.0|               1784|           9857.0|               0.859|             4741|               81|              0.818|                0.503| |    max|               73|                 5|                56|                       6|                     6|                    6|          34516.0|               2517|          34516.0|               3.397|            18484|              139|              3.714|                0.999| +-------+-----------------+------------------+------------------+------------------------+----------------------+---------------------+-----------------+-------------------+-----------------+--------------------+-----------------+-----------------+-------------------+---------------------+<\/code><\/pre>\n<p>\u0412\u0438\u0434\u043d\u043e, \u0447\u0442\u043e \u0432 \u0434\u0430\u043d\u043d\u044b\u0445 \u043d\u0435\u0442 \u043f\u0440\u043e\u043f\u0443\u0441\u043a\u043e\u0432 \u0438 \u0432\u044b\u0431\u0440\u043e\u0441\u043e\u0432.<\/p>\n<p>\u0422\u0435\u043f\u0435\u0440\u044c \u0434\u0430\u0432\u0430\u0439\u0442\u0435 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043a\u043e\u043b\u043e\u043d\u043a\u0438\u00a0<code>Customer_Age<\/code><\/p>\n<pre><code>df.groupBy($\"Customer_Age\").count().show(100)<\/code><\/pre>\n<p>JetBrains Big Data Tools \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u044f\u0442\u044c \u0432\u044b\u0432\u043e\u0434 \u0432 \u0432\u0438\u0434\u0435 \u0433\u0440\u0430\u0444\u0438\u043a\u043e\u0432.<\/p>\n<figure class=\"full-width\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/557\/971\/b8d\/557971b8dc11fc597def0ab67a32eed1.png\" alt=\"\" title=\"\" width=\"1024\" height=\"226\" data-src=\"https:\/\/habrastorage.org\/getpro\/habr\/upload_files\/557\/971\/b8d\/557971b8dc11fc597def0ab67a32eed1.png\"\/><figcaption><\/figcaption><\/figure>\n<p>\u0412\u0438\u0434\u043d\u043e, \u0447\u0442\u043e \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438\u00a0<code>Customer_Age<\/code>\u00a0\u0438\u043c\u0435\u0435\u0442 \u043f\u0440\u0430\u043a\u0442\u0438\u0447\u0435\u0441\u043a\u0438 \u043d\u043e\u0440\u043c\u0430\u043b\u044c\u043d\u043e\u0435 \u0440\u0430\u0441\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u0438\u0435.<\/p>\n<h4>\u0426\u0435\u043b\u0435\u0432\u0430\u044f \u043a\u043e\u043b\u043e\u043d\u043a\u0430<\/h4>\n<p>\u041a\u043e\u043b\u043e\u043d\u043a\u0430\u00a0<code>Attrition_Flag<\/code>\u00a0\u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u043f\u0440\u0438\u0437\u043d\u0430\u043a \u043e\u0442\u0442\u043e\u043a\u0430 \u0432 \u0432 \u0432\u0438\u0434\u0435 \u0442\u0435\u043a\u0441\u0442\u043e\u0432\u043e\u0433\u043e \u043e\u043f\u0438\u0441\u0430\u043d\u0438\u044f. \u0414\u043b\u044f \u043c\u043e\u0434\u0435\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f \u043d\u0430\u0434\u043e \u043f\u0440\u0438\u0432\u0435\u0441\u0442\u0438 \u0435\u0433\u043e \u043a \u0447\u0438\u0441\u043b\u043e\u0432\u043e\u043c\u0443 \u0432\u0438\u0434\u0443. \u041f\u043e\u044d\u0442\u043e\u043c\u0443 \u0432\u0432\u0435\u0434\u0451\u043c \u043d\u043e\u0432\u0443\u044e \u043a\u043e\u043b\u043e\u043d\u043a\u0443\u00a0<code>target<\/code>, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u0431\u0443\u0434\u0435\u0442 \u0440\u0430\u0432\u043d\u0430 0, \u043a\u043e\u0433\u0434\u0430 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435\u00a0<code>Attrition_Flag<\/code>\u00a0\u0440\u0430\u0432\u043d\u043e \u201cExisting Customer\u201d, \u0438 1 \u0432 \u043e\u0441\u0442\u0430\u043b\u044c\u043d\u044b\u0445 \u0441\u043b\u0443\u0447\u0430\u044f\u0445.<\/p>\n<pre><code>val dft = df.withColumn(\"target\", when($\"Attrition_Flag\" === \"Existing Customer\", 0).otherwise(1))<\/code><\/pre>\n<p><code>dft<\/code>\u00a0\u2013 \u043d\u043e\u0432\u044b\u0439 DataFrame \u0441 \u0446\u0435\u043b\u0435\u0432\u043e\u0439 \u043a\u043e\u043b\u043e\u043d\u043a\u043e\u0439\u00a0<code>target<\/code>.<\/p>\n<h4>\u041f\u0440\u043e\u0432\u0435\u0440\u043a\u0430 \u0441\u0431\u0430\u043b\u0430\u043d\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u043e\u0441\u0442\u0438 \u0434\u0430\u043d\u043d\u044b\u0445<\/h4>\n<p>\u0421\u043b\u0435\u0434\u0443\u044e\u0449\u0435\u0435, \u0447\u0442\u043e \u043d\u0430\u0434\u043e \u0441\u0434\u0435\u043b\u0430\u0442\u044c \u2013 \u043f\u0440\u043e\u0432\u0435\u0440\u0438\u0442\u044c \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u043d\u0430 \u0441\u0431\u0430\u043b\u0430\u043d\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u043e\u0441\u0442\u044c \u043a\u043b\u0430\u0441\u0441\u043e\u0432.<\/p>\n<p>\u041c\u044b \u0440\u0435\u0448\u0430\u0435\u043c \u0437\u0430\u0434\u0430\u0447\u0443 \u0431\u0438\u043d\u0430\u0440\u043d\u043e\u0439 \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u0438, \u0443 \u043d\u0430\u0441 \u0434\u0432\u0430 \u043a\u043b\u0430\u0441\u0441\u0430. \u041f\u0440\u043e\u0432\u0435\u0440\u0438\u043c \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u043e \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u0432 \u043a\u0430\u0436\u0434\u043e\u043c \u043a\u043b\u0430\u0441\u0441\u0435.<\/p>\n<pre><code>dft.groupBy(\"target\").count.show<\/code><\/pre>\n<pre><code>+------+-----+ |target|count| +------+-----+ |     1| 1627| |     0| 8500| +------+-----+<\/code><\/pre>\n<p>\u0415\u0441\u0442\u044c \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u0432 \u0440\u0435\u0448\u0435\u043d\u0438\u044f \u043f\u0440\u043e\u0431\u043b\u0435\u043c\u044b \u043d\u0435\u0441\u0431\u0430\u043b\u0430\u043d\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u0445 \u0434\u0430\u043d\u043d\u044b\u0445. \u0427\u0430\u0449\u0435 \u0432\u0441\u0435\u0433\u043e \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0442\u00a0<code>undersampling<\/code>\u00a0\u2013 \u0443\u043c\u0435\u043d\u044c\u0448\u0435\u043d\u0438\u0435 \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u0430 \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u0431\u043e\u043b\u044c\u0448\u0435\u0433\u043e \u043a\u043b\u0430\u0441\u0441\u0430, \u0438\u00a0<code>oversampling<\/code>\u00a0\u2013 \u0443\u0432\u0435\u043b\u0438\u0447\u0435\u043d\u0438\u0435 \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u0430 \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u043c\u0435\u043d\u044c\u0448\u0435\u0433\u043e \u043a\u043b\u0430\u0441\u0441\u0430.<\/p>\n<p>\u0414\u0430\u043d\u043d\u044b\u0445 \u0443 \u043d\u0430\u0441 \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u043c\u043d\u043e\u0433\u043e, \u043f\u043e\u044d\u0442\u043e\u043c\u0443 \u0431\u0443\u0434\u0435\u043c \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c oversampling.<\/p>\n<h4>Oversampling<\/h4>\n<p>\u0412\u044b\u0434\u0435\u043b\u0438\u043c \u0432 \u043e\u0442\u0434\u0435\u043b\u044c\u043d\u044b\u0435 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u044b\u0435 \u0434\u0430\u043d\u043d\u044b\u0435 \u0440\u0430\u0437\u043d\u044b\u0445 \u043a\u043b\u0430\u0441\u0441\u043e\u0432 \u0438 \u0441\u043e\u0445\u0440\u0430\u043d\u0438\u043c \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u043e \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u0432 \u043a\u0430\u0436\u0434\u043e\u043c \u043a\u043b\u0430\u0441\u0441\u0435.<\/p>\n<pre><code>val df1 = dft.filter($\"target\" === 1) val df0 = dft.filter($\"target\" === 0)   val df1count = df1.count val df0count = df0.count<\/code><\/pre>\n<p>\u041d\u0443\u0436\u043d\u043e \u0443\u0432\u0435\u043b\u0438\u0447\u0438\u0442\u044c \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u043e \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u0432 \u043d\u0430\u0431\u043e\u0440\u0435\u00a0<em>df1<\/em>\u00a0\u0432\u00a0<code>df0count \/ df1count<\/code>\u00a0\u0440\u0430\u0437:<\/p>\n<pre><code>val df1Over = df1         .withColumn(\"dummy\", explode(lit((1 to (df0count \/ df1count).toInt).toArray)))         .drop(\"dummy\")<\/code><\/pre>\n<p>\u0414\u0430\u0432\u0430\u0439\u0442\u0435 \u0440\u0430\u0441\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u044d\u0442\u043e \u043f\u043e\u0434\u0440\u043e\u0431\u043d\u0435\u0435.<\/p>\n<p>\u041a\u043e\u043d\u0441\u0442\u0440\u0443\u043a\u0446\u0438\u044f\u00a0<code>(1 to (df0count \/ df1count).toInt).toArray<\/code>\u00a0\u0441\u043e\u0437\u0434\u0430\u0451\u0442 \u043c\u0430\u0441\u0441\u0438\u0432 \u0441\u043e \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f\u043c\u0438 \u043e\u0442 1 \u0434\u043e\u00a0<code>(df0count \/ df1count)<\/code><\/p>\n<pre><code>(1 to (df0count \/ df1count).toInt).toArray res77: Array[Int] = Array(1, 2, 3, 4, 5)<\/code><\/pre>\n<p>\u0424\u0443\u043d\u043a\u0446\u0438\u044f\u00a0<code>lit<\/code>\u00a0\u0441\u043e\u0437\u0434\u0430\u0451\u0442 \u043a\u043e\u043b\u043e\u043d\u043a\u0438 \u0441 \u043e\u043f\u0440\u0435\u0434\u0435\u043b\u0451\u043d\u043d\u044b\u043c \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435\u043c. \u041c\u044b \u0434\u043e\u0431\u0430\u0432\u043b\u044f\u0435\u043c \u043a\u043e\u043b\u043e\u043d\u043a\u0443 \u0441 \u0438\u043c\u0435\u043d\u0435\u043c\u00a0<code>dummy<\/code>, \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435\u043c \u043a\u043e\u0442\u043e\u0440\u043e\u0439 \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f \u043c\u0430\u0441\u0441\u0438\u0432:<\/p>\n<pre><code>df1         .withColumn(\"dummy\", lit((1 to (df0count \/ df1count).toInt).toArray))         .select(\"Attrition_Flag\", \"Customer_Age\", \"dummy\")         .show(10)<\/code><\/pre>\n<pre><code>+-----------------+------------+---------------+ |   Attrition_Flag|Customer_Age|          dummy| +-----------------+------------+---------------+ |Attrited Customer|          62|[1, 2, 3, 4, 5]| |Attrited Customer|          66|[1, 2, 3, 4, 5]| |Attrited Customer|          54|[1, 2, 3, 4, 5]| |Attrited Customer|          56|[1, 2, 3, 4, 5]| |Attrited Customer|          48|[1, 2, 3, 4, 5]| |Attrited Customer|          55|[1, 2, 3, 4, 5]| |Attrited Customer|          47|[1, 2, 3, 4, 5]| |Attrited Customer|          53|[1, 2, 3, 4, 5]| |Attrited Customer|          48|[1, 2, 3, 4, 5]| |Attrited Customer|          59|[1, 2, 3, 4, 5]| +-----------------+------------+---------------+ only showing top 10 rows<\/code><\/pre>\n<p>\u0424\u0443\u043d\u043a\u0446\u0438\u044f\u00a0<code>explode<\/code>\u00a0\u0441\u043e\u0437\u0434\u0430\u0451\u0442 \u043d\u043e\u0432\u0443\u044e \u0441\u0442\u0440\u043e\u043a\u0443 \u0434\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0433\u043e \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u0430 \u043c\u0430\u0441\u0441\u0438\u0432\u0430:<\/p>\n<pre><code>df1         .withColumn(\"dummy\", explode(lit((1 to (df0count \/ df1count).toInt).toArray)))         .select(\"Attrition_Flag\", \"Customer_Age\", \"dummy\")         .show(10)<\/code><\/pre>\n<pre><code>+-----------------+------------+-----+ |   Attrition_Flag|Customer_Age|dummy| +-----------------+------------+-----+ |Attrited Customer|          62|    1| |Attrited Customer|          62|    2| |Attrited Customer|          62|    3| |Attrited Customer|          62|    4| |Attrited Customer|          62|    5| |Attrited Customer|          66|    1| |Attrited Customer|          66|    2| |Attrited Customer|          66|    3| |Attrited Customer|          66|    4| |Attrited Customer|          66|    5| +-----------------+------------+-----+ only showing top 10 rows<\/code><\/pre>\n<p>\u0418\u0442\u0430\u043a,\u00a0<em>df1Over<\/em>\u00a0\u2013 \u044d\u0442\u043e \u043d\u0430\u0431\u043e\u0440, \u0441\u043e\u0434\u0435\u0440\u0436\u0430\u0449\u0438\u0439 \u0437\u0430\u043f\u0438\u0441\u0438 \u043a\u043b\u0430\u0441\u0441\u0430\u00a0<code>target = 1<\/code>, \u0443\u0432\u0435\u043b\u0438\u0447\u0435\u043d\u043d\u044b\u0439 \u0432\u00a0<code>df0count \/ df1count<\/code>\u00a0\u0440\u0430\u0437.<\/p>\n<p>\u041e\u0431\u044a\u0435\u0434\u0438\u043d\u0438\u043c \u044d\u0442\u043e\u0442 \u043d\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0441 \u043d\u0430\u0431\u043e\u0440\u043e\u043c \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u0432\u0442\u043e\u0440\u043e\u0433\u043e \u043a\u043b\u0430\u0441\u0441\u0430 \u0438 \u043f\u0440\u043e\u0432\u0435\u0440\u0438\u043c \u0441\u0431\u0430\u043b\u0430\u043d\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u043e\u0441\u0442\u044c \u0438\u0441\u0445\u043e\u0434\u043d\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430:<\/p>\n<pre><code>val data = df0.unionAll(df1Over) data.groupBy(\"target\").count.show<\/code><\/pre>\n<pre><code>+------+-----+ |target|count| +------+-----+ |     1| 8135| |     0| 8500| +------+-----+<\/code><\/pre>\n<p>DataFrame\u00a0<em>data<\/em>\u00a0\u2013 \u044d\u0442\u043e \u0441\u0431\u0430\u043b\u0430\u043d\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445, \u0441 \u043a\u043e\u0442\u043e\u0440\u044b\u043c \u043c\u044b \u0431\u0443\u0434\u0435\u043c \u0434\u0430\u043b\u044c\u0448\u0435 \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c.<\/p>\n<h3>\u041f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043a\u0430 \u0434\u0430\u043d\u043d\u044b\u0445 (\u0440\u0430\u0431\u043e\u0442\u0430 \u0441 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438)<\/h3>\n<p>\u0414\u043b\u044f \u044d\u0442\u0430\u043f\u0430 \u043f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043a\u0438 \u0434\u0430\u043d\u043d\u044b\u0445 \u0432 Spark ML \u0435\u0441\u0442\u044c \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u0435 \u0433\u0440\u0443\u043f\u043f\u044b \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u043e\u0432:<\/p>\n<ul>\n<li>\n<p><strong>Extraction<\/strong>\u00a0\u2013 \u0438\u0437\u0432\u043b\u0435\u0447\u0435\u043d\u0438\u0435 \u043e\u0431\u044a\u0435\u043a\u0442\u043e\u0432 \u0438\u0437 \u201c\u043d\u0435\u043e\u0431\u0440\u0430\u0431\u043e\u0442\u0430\u043d\u043d\u044b\u0445\u201d \u0434\u0430\u043d\u043d\u044b\u0445;<\/p>\n<\/li>\n<li>\n<p><strong>Transformation<\/strong>\u00a0\u2013 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435, \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u0435 \u0438\u043b\u0438 \u0438\u0437\u043c\u0435\u043d\u0435\u043d\u0438\u0435 \u043e\u0431\u044a\u0435\u043a\u0442\u043e\u0432;<\/p>\n<\/li>\n<li>\n<p><strong>Selection<\/strong>\u00a0\u2013 \u0432\u044b\u0431\u043e\u0440 \u043f\u043e\u0434\u043c\u043d\u043e\u0436\u0435\u0441\u0442\u0432\u0430 \u0438\u0437 \u0431\u043e\u043b\u044c\u0448\u0435\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u043e\u0431\u044a\u0435\u043a\u0442\u043e\u0432;<\/p>\n<\/li>\n<li>\n<p><strong>Locality Sensitive Hashing (LSH)<\/strong>\u00a0\u2013 \u044d\u0442\u043e\u0442 \u043a\u043b\u0430\u0441\u0441 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u043e\u0432 \u0441\u043e\u0447\u0435\u0442\u0430\u0435\u0442 \u0432 \u0441\u0435\u0431\u0435 \u0430\u0441\u043f\u0435\u043a\u0442\u044b \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0441 \u0434\u0440\u0443\u0433\u0438\u043c\u0438 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u0430\u043c\u0438.<\/p>\n<\/li>\n<\/ul>\n<p>\u0420\u0430\u0431\u043e\u0442\u0430\u044e\u0442 \u043e\u043d\u0438 \u043f\u043e\u0445\u043e\u0436\u0438\u043c \u043e\u0431\u0440\u0430\u0437\u043e\u043c:<\/p>\n<ul>\n<li>\n<p>\u0421\u043e\u0437\u0434\u0430\u0451\u043c \u043e\u0431\u044a\u0435\u043a\u0442-\u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c \u0441 \u043d\u0443\u0436\u043d\u044b\u043c\u0438 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u0430\u043c\u0438;<\/p>\n<\/li>\n<li>\n<p>\u041f\u0440\u0438\u043c\u0435\u043d\u044f\u0435\u043c \u044d\u0442\u043e\u0442 \u043e\u0431\u044a\u0435\u043a\u0442 \u043a \u0438\u0441\u0445\u043e\u0434\u043d\u043e\u043c\u0443 \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445;<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u043b\u0443\u0447\u0430\u0435\u043c \u043d\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445, \u0441 \u043a\u043e\u0442\u043e\u0440\u044b\u043c \u043f\u0440\u043e\u0434\u043e\u043b\u0436\u0430\u0435\u043c \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c.<\/p>\n<\/li>\n<\/ul>\n<p>\u041f\u0435\u0440\u0435\u0439\u0434\u0451\u043c \u043a \u0440\u0430\u0431\u043e\u0442\u0435 \u0441 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438 \u043d\u0430\u0448\u0435\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<h4>\u041f\u0440\u043e\u0432\u0435\u0440\u0438\u043c \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0438 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432<\/h4>\n<p>\u041d\u0430\u0434\u043e \u043f\u0440\u043e\u0432\u0435\u0440\u044f\u0442\u044c \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u044e \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u043c\u0435\u0436\u0434\u0443 \u0441\u043e\u0431\u043e\u0439 \u0438 \u0438\u0441\u043a\u043b\u044e\u0447\u0430\u0442\u044c \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 \u0441 \u0432\u044b\u0441\u043e\u043a\u043e\u0439 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0435\u0439.<\/p>\n<p>\u0421\u043e\u0441\u0442\u0430\u0432\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u0432\u0441\u0435\u0445 \u043f\u0430\u0440 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432:<\/p>\n<pre><code>val numericColumnsPairs = numericColumns.flatMap(f1 => numericColumns.map(f2 => (f1, f2)))<\/code><\/pre>\n<p>\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>numericColumns<\/code>\u00a0\u2013 \u044d\u0442\u043e \u043c\u0430\u0441\u0441\u0438\u0432 \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u0439 \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0441 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u043c\u0438 \u0442\u0438\u043f\u0430\u043c\u0438 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0439 (\u0446\u0435\u043b\u044b\u0435 \u0438\u043b\u0438 \u0441 \u043f\u043b\u0430\u0432\u0430\u044e\u0449\u0435\u0439 \u0442\u043e\u0447\u043a\u043e\u0439).<\/p>\n<p>\u0421\u043f\u0438\u0441\u043e\u043a \u0432\u0441\u0435\u0445 \u043f\u0430\u0440 \u043c\u043e\u0436\u043d\u043e \u0442\u0430\u043a\u0436\u0435 \u043f\u043e\u043b\u0443\u0447\u0438\u0442\u044c \u0442\u0430\u043a\u0438\u043c \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u043c:<\/p>\n<pre><code>for {   x &lt;- numericColumns   y &lt;- numericColumns } yield (x, y)<\/code><\/pre>\n<p>\u0424\u0430\u043a\u0442\u0438\u0447\u0435\u0441\u043a\u0438, \u044d\u0442\u043e \u0440\u0430\u0437\u043d\u044b\u0435 \u0441\u043f\u043e\u0441\u043e\u0431\u044b \u0437\u0430\u043f\u0438\u0441\u0438 \u043e\u0434\u043d\u043e\u0433\u043e \u0438 \u0442\u043e\u0433\u043e \u0436\u0435 \u0434\u0435\u0439\u0441\u0442\u0432\u0438\u044f.<\/p>\n<p>\u041f\u0440\u043e\u0432\u0435\u0440\u0438\u0442\u044c \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u044e \u0432 Spark \u043c\u043e\u0436\u043d\u043e \u0434\u0432\u0443\u043c\u044f \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u043c\u0438:<\/p>\n<ul>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/sql\/DataFrameStatFunctions.html\"><u>DataFrameStatFunctions<\/u><\/a>\u00a0\u2013 \u0421\u0442\u0430\u0442\u0438\u0441\u0442\u0438\u0447\u0435\u0441\u043a\u0438\u0435 \u0444\u0443\u043d\u043a\u0446\u0438\u0438 \u0434\u043b\u044f DataFrame;<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/stat\/Correlation%24.html\"><u>Correlation<\/u><\/a>\u00a0\u2013 API \u0434\u043b\u044f \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u043e\u043d\u043d\u044b\u0445 \u0444\u0443\u043d\u043a\u0446\u0438\u0439 \u0432 MLlib.<\/p>\n<\/li>\n<\/ul>\n<p>\u041f\u0440\u043e\u0432\u0435\u0440\u0438\u043c \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u044e \u043d\u0430\u0448\u0438\u0445 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u043e\u0431\u043e\u0438\u043c\u0438 \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u043c\u0438.<\/p>\n<h5>\u0412\u0410\u0420\u0418\u0410\u041d\u0422 1: DATAFRAMESTATFUNCTIONS<\/h5>\n<p>\u0421\u043e\u0441\u0442\u0430\u0432\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u0432\u0441\u0435\u0445 \u043f\u0430\u0440 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432, \u0443\u0431\u0435\u0440\u0451\u043c \u043f\u0430\u0440\u044b \u0438\u0437 \u043e\u0434\u0438\u043d\u0430\u043a\u043e\u0432\u044b\u0445 \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u0439, \u043e\u0442\u0441\u043e\u0440\u0442\u0438\u0440\u0443\u0435\u043c \u043f\u0430\u0440\u044b \u0432 \u043b\u0435\u043a\u0441\u0438\u0433\u0440\u0430\u0444\u0438\u0447\u0435\u0441\u043a\u043e\u043c \u043f\u043e\u0440\u044f\u0434\u043a\u0435, \u0438 \u043e\u0441\u0442\u0430\u0432\u0438\u043c \u0442\u043e\u043b\u044c\u043a\u043e \u0443\u043d\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0435 \u043a\u043e\u043c\u0431\u0438\u043d\u0430\u0446\u0438\u0438 \u043f\u0430\u0440:<\/p>\n<pre><code>val pairs = numericColumnsPairs         .filter { p => !p._1.equals(p._2) }         .map { p => if (p._1 &lt; p._2) (p._1, p._2) else (p._2, p._1) }         .distinct<\/code><\/pre>\n<p>\u0414\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0439 \u043f\u0430\u0440\u044b \u043f\u0440\u0438\u043c\u0435\u043d\u0438\u043c \u0441\u0442\u0430\u0442\u0438\u0441\u0442\u0438\u0447\u0435\u0441\u043a\u0443\u044e \u0444\u0443\u043d\u043a\u0446\u0438\u044e \u0432\u044b\u0447\u0438\u0441\u043b\u0435\u043d\u0438\u044f \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0438 \u043a \u0441\u0431\u0430\u043b\u0430\u043d\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u043e\u043c\u0443 \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445 \u0438 \u0432\u044b\u0434\u0435\u043b\u0438\u043c \u043f\u0430\u0440\u044b \u0441 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0435\u0439 \u0431\u043e\u043b\u044c\u0448\u0435 0.6:<\/p>\n<pre><code>val corr = pairs         .map { p => (p._1, p._2, data.stat.corr(p._1, p._2)) }         .filter(_._3 > 0.6)<\/code><\/pre>\n<p>\u0412\u044b\u0432\u0435\u0434\u0435\u043c \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442 \u0432 \u0443\u0434\u043e\u0431\u043d\u043e\u043c \u0432\u0438\u0434\u0435:<\/p>\n<pre><code>corr.sortBy(_._3).reverse.foreach { c => println(f\"{c._2}%25s\\t${c._3}\") }       Avg_Open_To_Buy             Credit_Limit0.9952040726156253       Total_Trans_Amt           Total_Trans_Ct0.8053901681243808          Customer_Age           Months_on_book0.7805047706891142 Avg_Utilization_Ratio      Total_Revolving_Bal0.6946855441968229&lt;\/code>&lt;\/pre>&lt;h4 style=\"overflow-wrap: break-word; border: 0px; font-family: LeagueGothicRegular, Arial, sans-serif; font-size: 2.4rem; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; clear: both; line-height: 2; letter-spacing: 1px; text-transform: uppercase;\">\u0412\u0410\u0420\u0418\u0410\u041d\u0422 2: CORRELATION&lt;\/h4>&lt;p style=\"overflow-wrap: break-word; border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px 0px 1.5em; outline: 0px; padding: 0px; vertical-align: baseline;\">\u0427\u0442\u043e\u0431\u044b \u0432\u043e\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u0441\u044f \u0432\u0442\u043e\u0440\u044b\u043c \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u043c, \u043d\u0430\u0434\u043e \u0441\u043e\u0431\u0440\u0430\u0442\u044c \u0432\u0441\u0435 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 \u0432 \u043e\u0434\u043d\u0443 \u043a\u043e\u043b\u043e\u043d\u043a\u0443 \u0442\u0438\u043f\u0430 Vector. \u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c&lt;span>\u00a0&lt;\/span>&lt;a rel=\"noreferrer noopener\" href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#vectorassembler\" target=\"_blank\" style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; color: rgb(255, 255, 255); text-decoration: underline;\">VectorAssembler&lt;\/a>. \u041f\u0440\u0438\u043c\u0435\u043d\u0438\u0432 VectorAssembler \u043a \u043d\u0430\u0448\u0435\u043c\u0443 \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445, \u043f\u043e\u043b\u0443\u0447\u0438\u043c \u043d\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445&lt;span>\u00a0&lt;\/span>&lt;em style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: italic; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">numeric&lt;\/em>&lt;span>\u00a0&lt;\/span>\u0441 \u043a\u043e\u043b\u043e\u043d\u043a\u043e\u0439&lt;span>\u00a0&lt;\/span>&lt;em style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: italic; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">features&lt;\/em>, \u0441\u043e\u0434\u0435\u0440\u0436\u0430\u0449\u0435\u0439 \u0432\u0435\u043a\u0442\u043e\u0440 \u0441 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u043c\u0438 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438.&lt;\/p>&lt;p style=\"overflow-wrap: break-word; border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px 0px 1.5em; outline: 0px; padding: 0px; vertical-align: baseline;\">\u041f\u0440\u0438\u043c\u0435\u043d\u0438\u0432 \u043c\u0435\u0442\u043e\u0434&lt;span>\u00a0&lt;\/span>&lt;em style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: italic; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">corr&lt;\/em>&lt;span>\u00a0&lt;\/span>\u043e\u0431\u044a\u0435\u043a\u0442\u0430&lt;span>\u00a0&lt;\/span>&lt;em style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: italic; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">Correlation&lt;\/em>&lt;span>\u00a0&lt;\/span>\u043a \u043d\u043e\u0432\u043e\u043c\u0443 \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445&lt;span>\u00a0&lt;\/span>&lt;em style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: italic; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">numeric&lt;\/em>, \u043c\u043e\u0436\u043d\u043e \u043f\u043e\u043b\u0443\u0447\u0438\u0442\u044c \u043c\u0430\u0442\u0440\u0438\u0446\u0443 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0438 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432:&lt;\/p>&lt;div class=\"wp-block-syntaxhighlighter-code \" style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">&lt;div style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">&lt;div id=\"highlighter_258867\" class=\"syntaxhighlighter  scala\" style=\"border: 0px; font-family: inherit; font-size: 1em !important; font-style: inherit; font-weight: inherit; margin: 1em 0px !important; outline: 0px; padding: 0px; vertical-align: baseline; width: 690px; position: relative !important; overflow: auto hidden !important; background-color: white !important;\">&lt;table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; border-collapse: separate; border-spacing: 0px; width: 690px; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; table-layout: auto !important;\">&lt;tbody style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important;\">&lt;tr style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important;\">&lt;td class=\"gutter\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; text-align: left !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; color: rgb(175, 175, 175) !important;\">&lt;div class=\"line number1 index0 alt2\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">1&lt;\/div>&lt;div class=\"line number2 index1 alt1\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">2&lt;\/div>&lt;div class=\"line number3 index2 alt2\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">3&lt;\/div>&lt;div class=\"line number4 index3 alt1\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">4&lt;\/div>&lt;div class=\"line number5 index4 alt2\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">5&lt;\/div>&lt;div class=\"line number6 index5 alt1\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">6&lt;\/div>&lt;div class=\"line number7 index6 alt2\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">7&lt;\/div>&lt;div class=\"line number8 index7 alt1\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">8&lt;\/div>&lt;div class=\"line number9 index8 alt2\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">9&lt;\/div>&lt;div class=\"line number10 index9 alt1\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">10&lt;\/div>&lt;div class=\"line number11 index10 alt2\" style=\"border-width: 0px 3px 0px 0px !important; border-top-style: initial !important; border-right-style: solid !important; border-bottom-style: initial !important; border-left-style: initial !important; border-top-color: initial !important; border-right-color: rgb(108, 226, 108) !important; border-bottom-color: initial !important; border-left-color: initial !important; border-image: initial !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 0.5em 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: right !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">11&lt;\/div>&lt;\/td>&lt;td class=\"code\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; text-align: left !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; width: 645.406px; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important;\">&lt;div class=\"container\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: relative !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important;\">&lt;div class=\"line number1 index0 alt2\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">import&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">org.apache.spark.ml.feature.VectorAssembler&lt;\/code>&lt;\/div>&lt;div class=\"line number2 index1 alt1\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">import&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">org.apache.spark.ml.stat.Correlation&lt;\/code>&lt;\/div>&lt;div class=\"line number3 index2 alt2\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">import&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">org.apache.spark.ml.linalg.Matrix&lt;\/code>&lt;\/div>&lt;div class=\"line number4 index3 alt1\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">import&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">org.apache.spark.sql.Row&lt;\/code>&lt;\/div>&lt;div class=\"line number5 index4 alt2\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">\u00a0&lt;\/div>&lt;div class=\"line number6 index5 alt1\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">val&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">numericAssembler &lt;\/code>&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">=&lt;\/code> &lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">new&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">VectorAssembler()&lt;\/code>&lt;\/div>&lt;div class=\"line number7 index6 alt2\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala spaces\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important;\">\u00a0\u00a0&lt;\/code>&lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">.setInputCols(numericColumns)&lt;\/code>&lt;\/div>&lt;div class=\"line number8 index7 alt1\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala spaces\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important;\">\u00a0\u00a0&lt;\/code>&lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">.setOutputCol(&lt;\/code>&lt;code class=\"scala string\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: blue !important;\">\"features\"&lt;\/code>&lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">)&lt;\/code>&lt;\/div>&lt;div class=\"line number9 index8 alt2\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">\u00a0&lt;\/div>&lt;div class=\"line number10 index9 alt1\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">val&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">numeric &lt;\/code>&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">=&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">numericAssembler.transform(data)&lt;\/code>&lt;\/div>&lt;div class=\"line number11 index10 alt2\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px 1em !important; vertical-align: baseline !important; border-radius: 0px !important; background: none white !important; inset: auto !important; float: none !important; height: auto !important; line-height: 1.1em !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; white-space: pre !important;\">&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">val&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">Row(matrix&lt;\/code>&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">:&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">Matrix) &lt;\/code>&lt;code class=\"scala keyword\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: bold !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: rgb(0, 102, 153) !important;\">=&lt;\/code> &lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">Correlation.corr(numeric, &lt;\/code>&lt;code class=\"scala string\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: blue !important;\">\"features\"&lt;\/code>&lt;code class=\"scala plain\" style=\"border: 0px !important; font-family: Monaco, Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace !important; font-size: 1em !important; font-style: normal !important; font-weight: normal !important; margin: 0px !important; outline: 0px !important; padding: 0px !important; vertical-align: baseline !important; font-variant: normal; font-stretch: normal; line-height: 1.1em !important; border-radius: 0px !important; background: none !important; inset: auto !important; float: none !important; height: auto !important; overflow: visible !important; position: static !important; text-align: left !important; width: auto !important; box-sizing: content-box !important; direction: ltr !important; box-shadow: none !important; display: inline !important; color: black !important;\">).head&lt;\/code>&lt;\/div>&lt;\/div>&lt;\/td>&lt;\/tr>&lt;\/tbody>&lt;\/table>&lt;\/div>&lt;\/div>&lt;\/div>&lt;p style=\"overflow-wrap: break-word; border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px 0px 1.5em; outline: 0px; padding: 0px; vertical-align: baseline;\">\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f&lt;span>\u00a0&lt;\/span>&lt;em style=\"border: 0px; font-family: inherit; font-size: 16px; font-style: italic; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;\">matrix&lt;\/em>&lt;span>\u00a0&lt;\/span>\u2013 \u044d\u0442\u043e \u043c\u0430\u0442\u0440\u0438\u0446\u0430 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0438 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432:&lt;\/p>&lt;pre class=\"wp-block-code\" style=\"border: 0px; font: 1.5rem \/ 1.6 &amp;quot;Courier 10 Pitch&amp;quot;, Courier, monospace; margin: 0px 0px 1.6em; outline: 0px; padding: 1.6em; vertical-align: baseline; background: rgb(238, 238, 238); color: rgb(51, 51, 51); overflow: auto; max-width: 100%;\">&lt;code style=\"border: 0px; font: 1.5rem Monaco, Consolas, &amp;quot;Andale Mono&amp;quot;, &amp;quot;DejaVu Sans Mono&amp;quot;, monospace; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; display: block; overflow-wrap: break-word; white-space: pre-wrap;\">matrix: org.apache.spark.ml.linalg.Matrix =  1.0                    -0.13575515707704905   ... (14 total) -0.13575515707704905   1.0                    ... 0.780504770689084      -0.11728062823959522   ... -0.026525310066416643  -0.032664177863511015  ... 0.13116552936201348    -0.0106575011505989...<\/code><\/pre>\n<p>\u0422\u0435\u043f\u0435\u0440\u044c \u0441\u043e\u043f\u043e\u0441\u0442\u0430\u0432\u0438\u043c \u043c\u0430\u0442\u0440\u0438\u0446\u0443 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0438 \u0441 \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u044f\u043c\u0438 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432, \u0443\u0431\u0435\u0440\u0451\u043c \u043f\u0430\u0440\u044b \u0438\u0437 \u043e\u0434\u0438\u043d\u0430\u043a\u043e\u0432\u044b\u0445 \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u0439, \u0432\u044b\u0434\u0435\u043b\u0438\u043c \u043f\u0430\u0440\u044b \u0441 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0435\u0439 \u0431\u043e\u043b\u044c\u0448\u0435 0.6, \u043e\u0442\u0441\u043e\u0440\u0442\u0438\u0440\u0443\u0435\u043c \u043f\u0430\u0440\u044b \u0432 \u043b\u0435\u043a\u0441\u0438\u0433\u0440\u0430\u0444\u0438\u0447\u0435\u0441\u043a\u043e\u043c \u043f\u043e\u0440\u044f\u0434\u043a\u0435, \u0438 \u043e\u0441\u0442\u0430\u0432\u0438\u043c \u0442\u043e\u043b\u044c\u043a\u043e \u0443\u043d\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0435 \u043a\u043e\u043c\u0431\u0438\u043d\u0430\u0446\u0438\u0438 \u043f\u0430\u0440:<\/p>\n<pre><code>val corr2 = matrix.toArray         .zip(numericColumnsPairs)         .map(cnn => (cnn._2._1, cnn._2._2, cnn._1))         .filter(_._3 &lt; 1.0)         .filter(_._3 > 0.6)         .map { p => if (p._1 &lt; p._2) (p._1, p._2, p._3) else (p._2, p._1, p._3) }         .distinct<\/code><\/pre>\n<p>\u0412\u044b\u0432\u0435\u0434\u0435\u043c \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442 \u0432 \u0443\u0434\u043e\u0431\u043d\u043e\u043c \u0432\u0438\u0434\u0435:<\/p>\n<pre><code>corr2.sortBy(._3).reverse.foreach { c => println(f\"{c._2}%25s\\t${c._3}\") }       Avg_Open_To_Buy             Credit_Limit0.9952040726156179       Total_Trans_Amt           Total_Trans_Ct0.8053901681243786          Customer_Age           Months_on_book0.780504770689084 Avg_Utilization_Ratio      Total_Revolving_Bal0.6946855441968222&lt;\/code>&lt;\/pre>&lt;p style=\"overflow-wrap: break-word; border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px 0px 1.5em; outline: 0px; padding: 0px; vertical-align: baseline;\">\u0412\u0438\u0434\u043d\u043e, \u0447\u0442\u043e \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442, \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u043d\u044b\u0439 \u0440\u0430\u0437\u043d\u044b\u043c\u0438 \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u043c\u0438, \u0441\u043e\u0432\u043f\u0430\u0434\u0430\u0435\u0442.&lt;\/p>&lt;p style=\"overflow-wrap: break-word; border: 0px; font-family: inherit; font-size: 16px; font-style: inherit; font-weight: inherit; margin: 0px 0px 1.5em; outline: 0px; padding: 0px; vertical-align: baseline;\">\u0414\u043b\u044f \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0438 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u0438\u043c \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u044b \u0432 \u0432\u0438\u0436\u0435 \u043c\u043d\u043e\u0436\u0435\u0441\u0442\u0432 \u0438 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u0438\u0445 \u043f\u0435\u0440\u0435\u0441\u0435\u0447\u0435\u043d\u0438\u0435:&lt;\/p>&lt;pre class=\"wp-block-code\" style=\"border: 0px; font: 1.5rem \/ 1.6 &amp;quot;Courier 10 Pitch&amp;quot;, Courier, monospace; margin: 0px 0px 1.6em; outline: 0px; padding: 1.6em; vertical-align: baseline; background: rgb(238, 238, 238); color: rgb(51, 51, 51); overflow: auto; max-width: 100%;\">&lt;code style=\"border: 0px; font: 1.5rem Monaco, Consolas, &amp;quot;Andale Mono&amp;quot;, &amp;quot;DejaVu Sans Mono&amp;quot;, monospace; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; display: block; overflow-wrap: break-word; white-space: pre-wrap;\">corr.toSet.intersect(corr2.toSet)  res84: scala.collection.immutable.Set[(String, String, Double)] = Set()<\/code><\/pre>\n<p>\u0412\u0438\u0434\u043d\u043e, \u0447\u0442\u043e \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442, \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u043d\u044b\u0439 \u0440\u0430\u0437\u043d\u044b\u043c\u0438 \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u043c\u0438, \u0441\u043e\u0432\u043f\u0430\u0434\u0430\u0435\u0442.<\/p>\n<p>\u0414\u043b\u044f \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0438 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u0438\u043c \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u044b \u0432 \u0432\u0438\u0436\u0435 \u043c\u043d\u043e\u0436\u0435\u0441\u0442\u0432 \u0438 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u0438\u0445 \u043f\u0435\u0440\u0435\u0441\u0435\u0447\u0435\u043d\u0438\u0435:<\/p>\n<pre><code>corr.toSet.intersect(corr2.toSet)  res84: scala.collection.immutable.Set[(String, String, Double)] = Set()<\/code><\/pre>\n<p>\u041f\u043e\u043b\u0443\u0447\u0438\u043b\u0438 \u043f\u0443\u0441\u0442\u043e\u0435 \u043c\u043d\u043e\u0436\u0435\u0441\u0442\u0432\u043e, \u0447\u0442\u043e \u043f\u043e\u0434\u0442\u0432\u0435\u0440\u0436\u0434\u0430\u0435\u0442 \u044d\u043a\u0432\u0438\u0432\u0430\u043b\u0435\u043d\u0442\u043d\u043e\u0441\u0442\u044c \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u043e\u0432, \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u043d\u044b\u0445 \u0440\u0430\u0437\u043d\u044b\u043c\u0438 \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u043c\u0438.<\/p>\n<p>\u0421\u043e\u0431\u0435\u0440\u0451\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0441 \u043d\u0438\u0437\u043a\u043e\u0439 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0435\u0439 \u0432 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e\u00a0<code>numericColumnsFinal<\/code>:<\/p>\n<pre><code>val numericColumnsFinal = numericColumns.diff(corr.map(_._2))<\/code><\/pre>\n<h3>\u041a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438<\/h3>\n<p>\u0422\u0435\u043f\u0435\u0440\u044c \u0437\u0430\u0439\u043c\u0451\u043c\u0441\u044f \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u043c\u0438 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438.<\/p>\n<p><strong>\u041a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0439 \u043f\u0440\u0438\u0437\u043d\u0430\u043a<\/strong>\u00a0\u2013 \u044d\u0442\u043e \u043f\u0440\u0438\u0437\u043d\u0430\u043a, \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043a\u043e\u0442\u043e\u0440\u043e\u0433\u043e \u043e\u0431\u043e\u0437\u043d\u0430\u0447\u0430\u044e\u0442 \u043f\u0440\u0438\u043d\u0430\u0434\u043b\u0435\u0436\u043d\u043e\u0441\u0442\u044c \u043e\u0431\u044a\u0435\u043a\u0442\u0430 \u043a \u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0438. \u0417\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u2013 \u044d\u0442\u043e \u043d\u0430\u0431\u043e\u0440\u044b \u0434\u0438\u0441\u043a\u0440\u0435\u0442\u043d\u044b\u0445 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0439.<\/p>\n<p>\u041d\u043e \u043f\u043e\u0434\u0430\u0432\u043b\u044f\u044e\u0449\u0435\u0435 \u0431\u043e\u043b\u044c\u0448\u0438\u043d\u0441\u0442\u0432\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u0432 \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u0438 \u0438 \u0440\u0435\u0433\u0440\u0435\u0441\u0441\u0438\u0438 \u0441\u0444\u043e\u0440\u043c\u0443\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u044b \u0432 \u0442\u0435\u0440\u043c\u0438\u043d\u0430\u0445 \u043c\u0435\u0442\u0440\u0438\u0447\u0435\u0441\u043a\u0438\u0445 \u043f\u0440\u043e\u0441\u0442\u0440\u0430\u043d\u0441\u0442\u0432, \u0442\u043e \u0435\u0441\u0442\u044c \u043f\u043e\u0434\u0440\u0430\u0437\u0443\u043c\u0435\u0432\u0430\u044e\u0442 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u0435\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 \u0432 \u0432\u0438\u0434\u0435 \u0432\u0435\u0449\u0435\u0441\u0442\u0432\u0435\u043d\u043d\u044b\u0445 \u0432\u0435\u043a\u0442\u043e\u0440\u043e\u0432 \u043e\u0434\u0438\u043d\u0430\u043a\u043e\u0432\u043e\u0439 \u0440\u0430\u0437\u043c\u0435\u0440\u043d\u043e\u0441\u0442\u0438.<\/p>\n<p>\u041f\u043e\u044d\u0442\u043e\u043c\u0443 \u0434\u043b\u044f \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0438\u0445 \u043d\u0430\u0434\u043e \u043a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u0442\u044c \u2013 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u044c \u0432 \u043d\u0435\u043f\u0440\u0435\u0440\u044b\u0432\u043d\u044b\u0435. \u0412\u043c\u0435\u0441\u0442\u043e \u043e\u0434\u043d\u043e\u0439 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u043e\u0439 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439 \u0441\u043e\u0437\u0434\u0430\u0435\u0442\u0441\u044f \u043d\u0435\u0441\u043e\u043b\u044c\u043a\u043e, \u043f\u043e \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u0443 \u0443\u043d\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0445 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0439 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u043e\u0439 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439. \u0417\u043d\u0430\u0447\u0435\u043d\u0438\u044f\u043c\u0438 \u043d\u043e\u0432\u044b\u0445 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u044b\u0445 \u0431\u0443\u0434\u0443\u0442 1.0 \u0438 0.0 \u0432 \u0441\u043e\u043e\u0442\u0432\u0435\u0442\u0441\u0442\u0432\u0438\u0438 \u0441\u043e \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435\u043c \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u043e\u0439 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439.<\/p>\n<p>\u0414\u043b\u044f \u043a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0445 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u044b\u0445 \u0432 Spark ML \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#onehotencoder\"><u>OneHotEncoder<\/u><\/a>.<\/p>\n<p>\u041d\u043e \u043f\u0440\u0435\u0436\u0434\u0435, \u0447\u0435\u043c \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0442\u044c \u0435\u0433\u043e \u043a \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c, \u0441\u043e\u0434\u0435\u0440\u0436\u0430\u0449\u0438\u043c \u0441\u0442\u0440\u043e\u043a\u0438, \u0438\u0445 \u043d\u0430\u0434\u043e \u043f\u0440\u043e\u0438\u043d\u0434\u0435\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u0442\u044c. \u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#stringindexer\"><u>StringIndexer<\/u><\/a>.<\/p>\n<p>\u0412 \u043d\u0430\u0448\u0435\u043c \u043d\u0430\u0431\u043e\u0440\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u043c\u0438 \u044f\u0432\u043b\u044f\u044e\u0442\u0441\u044f \u0442\u043e\u043b\u044c\u043a\u043e \u043a\u043e\u043b\u043e\u043d\u043a\u0438, \u0441\u043e\u0434\u0435\u0440\u0436\u0430\u0449\u0438\u0435 \u0441\u0442\u0440\u043e\u043a\u0438.<\/p>\n<p>\u0418\u043d\u043e\u0433\u0434\u0430 \u043a \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u043c \u043e\u0442\u043d\u043e\u0441\u044f\u0442 \u0442\u0430\u043a\u043e\u0439 \u043f\u0440\u0438\u0437\u043d\u0430\u043a, \u043a\u0430\u043a \u0432\u043e\u0437\u0440\u0430\u0441\u0442. \u041d\u043e, \u043a\u0430\u043a \u043c\u044b \u0432\u0438\u0434\u0435\u043b\u0438, \u0432 \u043d\u0430\u0448\u0435\u043c \u0441\u043b\u0443\u0447\u0430\u0435 \u0432\u043e\u0437\u0440\u0430\u0441\u0442 \u0438\u043c\u0435\u0435\u0442 \u043f\u0440\u0430\u043a\u0442\u0438\u0447\u0435\u0441\u043a\u0438 \u043d\u043e\u0440\u043c\u0430\u043b\u044c\u043d\u043e\u0435 \u0440\u0430\u0441\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u0438\u0435. \u0412\u043e\u0442 \u0435\u0441\u043b\u0438 \u0431\u044b \u0443 \u043d\u0430\u0441 \u0431\u044b\u043b\u0430 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f \u0441 \u0433\u0440\u0443\u043f\u043f\u0430\u043c\u0438 \u0432\u043e\u0437\u0440\u0430\u0441\u0442\u043e\u0432, \u0442\u043e\u0433\u0434\u0430 \u0441 \u0442\u0430\u043a\u043e\u0439 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439 \u043d\u0430\u0434\u043e \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u043a\u0430\u043a \u0441 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u043e\u0439.<\/p>\n<h3>\u0418\u043d\u0434\u0435\u043a\u0441\u0438\u0440\u0443\u0435\u043c \u0441\u0442\u0440\u043e\u043a\u043e\u0432\u044b\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438<\/h3>\n<p>\u0421\u043e\u0441\u0442\u0430\u0432\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u0432\u0441\u0435\u0445 \u0441\u0442\u0440\u043e\u043a\u043e\u0432\u044b\u0445 \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0437\u0430 \u0438\u0441\u043a\u043b\u044e\u0447\u0435\u043d\u0438\u0435\u043c \u043a\u043e\u043b\u043e\u043d\u043a\u0438\u00a0<code>Attrition_Flag<\/code>, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f \u0446\u0435\u043b\u0435\u0432\u043e\u0439, \u0438 \u043f\u0440\u043e\u0438\u043d\u0434\u0435\u0441\u0438\u0440\u0443\u0435\u043c \u0438\u0445, \u0441\u043e\u0437\u0434\u0430\u0432 \u043d\u043e\u0432\u044b\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438, \u0434\u043e\u0431\u0430\u0432\u0438\u0432\u00a0<code>Indexed<\/code><em>\u00a0\u043a \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u044e \u0438\u0441\u0445\u043e\u0434\u043d\u044b\u0445 \u043a\u043e\u043b\u043e\u043d\u043e\u043a.<\/em><\/p>\n<pre><code>import org.apache.spark.ml.feature.StringIndexer   val stringColumns = data         .dtypes         .filter(_._2.equals(\"StringType\"))         .map(_._1)         .filter(!_.equals(\"Attrition_Flag\"))   val stringColumnsIndexed = stringColumns.map(_ + \"_Indexed\")   val indexer = new StringIndexer()         .setInputCols(stringColumns)         .setOutputCols(stringColumnsIndexed)   val indexed = indexer.fit(data).transform(data)<\/code><\/pre>\n<p><code>indexed<\/code><em>\u00a0\u2013<\/em> \u044d\u0442\u043e \u043d\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441 \u043f\u0440\u043e\u0438\u043d\u0434\u0435\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u043c\u0438 \u0441\u0442\u0440\u043e\u043a\u043e\u0432\u044b\u043c\u0438 \u043a\u043e\u043b\u043e\u043d\u043a\u0430\u043c\u0438.<\/p>\n<h3>\u041a\u043e\u0434\u0438\u0440\u0443\u0435\u043c \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438<\/h3>\n<p>\u0422\u0435\u043f\u0435\u0440\u044c \u043c\u043e\u0436\u043d\u043e \u043f\u0435\u0440\u0435\u0439\u0442\u0438 \u043a \u043a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044e \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432.<\/p>\n<p>\u041a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 \u0431\u0443\u0434\u0443\u0442 \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f \u0432 \u043d\u043e\u0432\u044b\u0445 \u043a\u043e\u043b\u043e\u043d\u043a\u0430\u0445, \u043a \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u044e \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u0431\u0443\u0434\u0435\u0442 \u0434\u043e\u0431\u0430\u0432\u043b\u0435\u043d\u043e\u00a0Coded<\/p>\n<pre><code>import org.apache.spark.ml.feature.OneHotEncoder   val catColumns = stringColumnsIndexed.map(_ + \"_Coded\")       val encoder = new OneHotEncoder()         .setInputCols(stringColumnsIndexed)         .setOutputCols(catColumns)   val encoded = encoder.fit(indexed).transform(indexed)<\/code><\/pre>\n<p>encoded\u00a0\u2013 \u044d\u0442\u043e \u043d\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441 \u043a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u043c\u0438 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u043c\u0438 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438<\/p>\n<h3>\u0421\u043e\u0431\u0438\u0440\u0430\u0435\u043c \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 \u0432 \u0432\u0435\u043a\u0442\u043e\u0440<\/h3>\n<p>\u041f\u043e\u0441\u043b\u0435 \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u043d\u0430\u0434\u043e \u0441\u043e\u0431\u0440\u0430\u0442\u044c \u0432\u0441\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 \u0432 \u0432\u0435\u043a\u0442\u043e\u0440.<\/p>\n<p>\u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#vectorassembler\"><u>VectorAssembler<\/u><\/a>, \u0441 \u043a\u043e\u0442\u043e\u0440\u044b\u043c \u043c\u044b \u0443\u0436\u0435 \u0432\u0441\u0442\u0440\u0435\u0447\u0430\u043b\u0438\u0441\u044c, \u043a\u043e\u0433\u0434\u0430 \u0432\u044b\u0447\u0438\u0441\u043b\u044f\u043b\u0438 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u044e \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0432\u0442\u043e\u0440\u044b\u043c \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u043c.<\/p>\n<p>\u041f\u0440\u0438\u043c\u0435\u043d\u0438\u043c \u0435\u0433\u043e \u043a \u0441\u043f\u0438\u0441\u043a\u0443 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0441 \u043d\u0438\u0437\u043a\u043e\u0439 \u043a\u043e\u0440\u0440\u0435\u043b\u044f\u0446\u0438\u0435\u0439, \u043e\u0431\u044a\u0435\u0434\u0438\u043d\u043d\u043e\u043c\u0443 \u0441\u043e \u0441\u043f\u0438\u0441\u043a\u043e\u043c \u043a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u0445 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0445 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u044b\u0445.<\/p>\n<pre><code>val featureColumns = numericColumnsFinal ++ catColumns   val assembler = new VectorAssembler()   .setInputCols(featureColumns)   .setOutputCol(\"features\")   val assembled = assembler.transform(encoded)<\/code><\/pre>\n<p>assembled\u00a0\u2013 \u044d\u0442\u043e \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445, \u0441\u043e\u0434\u0435\u0440\u0436\u0430\u0449\u0438\u0439 \u043a\u043e\u043b\u043e\u043d\u043a\u0443\u00a0features, \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f\u043c\u0438 \u043a\u043e\u0442\u043e\u0440\u043e\u0439 \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f \u0432\u0435\u043a\u0442\u043e\u0440 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432.<\/p>\n<h3>\u041d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u044f<\/h3>\n<p>\u0414\u0430\u0432\u0430\u0439\u0442\u0435 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u0432\u0435\u043a\u0442\u043e\u0440 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u043f\u043e\u043b\u0443\u0447\u0438\u043b\u0441\u044f \u0443 \u043d\u0430\u0441 \u0432 \u0438\u0442\u043e\u0433\u0435.<\/p>\n<pre><code>assembled.select(\"features\").show(5, truncate = false)<\/code><\/pre>\n<pre><code>+--------------------------------------------------------------------------------------------------------------------+ |features                                                                                                            | +--------------------------------------------------------------------------------------------------------------------+ |(28,[0,1,2,3,4,5,6,7,8,9,12,17,23,25],[45.0,3.0,5.0,1.0,3.0,11914.0,1.335,1144.0,1.625,0.061,1.0,1.0,1.0,1.0])      | |(28,[0,1,2,3,4,5,6,7,8,9,10,11,18,20,25],[49.0,5.0,6.0,1.0,2.0,7392.0,1.541,1291.0,3.714,0.105,1.0,1.0,1.0,1.0,1.0])| |(28,[0,1,2,3,5,6,7,8,11,17,22,25],[51.0,3.0,4.0,1.0,3418.0,2.594,1887.0,2.333,1.0,1.0,1.0,1.0])                     | |(28,[0,1,2,3,4,5,6,7,8,9,10,12,19,20,25],[40.0,4.0,3.0,4.0,1.0,796.0,1.405,1171.0,2.333,0.76,1.0,1.0,1.0,1.0,1.0])  | |(28,[0,1,2,3,5,6,7,8,14,17,23,25],[40.0,3.0,5.0,1.0,4716.0,2.175,816.0,2.5,1.0,1.0,1.0,1.0])                        | +--------------------------------------------------------------------------------------------------------------------+ only showing top 5 rows<\/code><\/pre>\n<p>\u0412\u0438\u0434\u043d\u0430 \u0431\u043e\u043b\u044c\u0448\u0430\u044f \u0440\u0430\u0437\u043d\u0438\u0446\u0430 \u0432 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432.<\/p>\n<p>\u0420\u0435\u043a\u043e\u043c\u0435\u043d\u0434\u0443\u0435\u0442\u0441\u044f \u043f\u0440\u043e\u0432\u0435\u0441\u0442\u0438\u00a0<strong>\u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u0438\u0437\u0430\u0446\u0438\u044e<\/strong>\u00a0(\u0443\u0434\u0430\u043b\u0435\u043d\u0438\u0435 \u0441\u0440\u0435\u0434\u043d\u0435\u0433\u043e \u0438 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 \u0434\u0438\u0441\u043f\u0435\u0440\u0441\u0438\u0438) \u0438\u043b\u0438\u00a0<strong>\u043d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u044e<\/strong>\u00a0(\u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f \u043e\u0442\u0434\u0435\u043b\u044c\u043d\u044b\u0445 \u043e\u0431\u0440\u0430\u0437\u0446\u043e\u0432 \u0434\u043e \u0435\u0434\u0438\u043d\u0438\u0447\u043d\u043e\u0439 \u043d\u043e\u0440\u043c\u044b) \u043d\u0430\u0431\u043e\u0440\u0430 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>\u0412 Spark ML \u0435\u0441\u0442\u044c \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u0432, \u0441 \u043f\u043e\u043c\u043e\u0449\u044c\u044e \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043c\u043e\u0436\u043d\u043e \u0441\u0434\u0435\u043b\u0430\u0442\u044c \u0442\u0430\u043a\u0438\u0435 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u044f:<\/p>\n<ul>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#normalizer\"><u>Normalizer<\/u><\/a>\u00a0\u2013 \u043d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u0443\u0435\u0442 \u0432\u0435\u043a\u0442\u043e\u0440 \u0434\u043b\u044f \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u0438\u044f \u0435\u0434\u0438\u043d\u0438\u0447\u043d\u043e\u0439 \u043d\u043e\u0440\u043c\u044b;<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#standardscaler\"><u>StandardScaler<\/u><\/a>\u00a0\u2013 \u043d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u044f \u043a\u0430\u0436\u0434\u043e\u0433\u043e \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430 \u0434\u043b\u044f \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u0438\u044f \u0435\u0434\u0438\u043d\u0438\u0447\u043d\u043e\u0433\u043e \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u043e\u0433\u043e \u043e\u0442\u043a\u043b\u043e\u043d\u0435\u043d\u0438\u044f \u0438\/\u0438\u043b\u0438 \u043d\u0443\u043b\u0435\u0432\u043e\u0433\u043e \u0441\u0440\u0435\u0434\u043d\u0435\u0433\u043e;<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#robustscaler\"><u>RobustScaler<\/u><\/a>\u00a0\u2013 \u0443\u0434\u0430\u043b\u0435\u043d\u0438\u0435 \u043c\u0435\u0434\u0438\u0430\u043d\u044b \u0438 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 \u0432 \u0441\u043e\u043e\u0442\u0432\u0435\u0442\u0441\u0442\u0432\u0438\u0438 \u0441 \u043e\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u043c \u0434\u0438\u0430\u043f\u0430\u0437\u043e\u043d\u043e\u043c \u043a\u0432\u0430\u043d\u0442\u0438\u043b\u0435\u0439;<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#minmaxscaler\"><u>MinMaxScaler<\/u><\/a>\u00a0\u2013 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 \u043a\u0430\u0436\u0434\u043e\u0433\u043e \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430 \u0432 \u043e\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u043d\u043e\u043c \u0434\u0438\u0430\u043f\u0430\u0437\u043e\u043d\u0435 (\u0447\u0430\u0441\u0442\u043e [0, 1]);<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#maxabsscaler\"><u>MaxAbsScaler<\/u><\/a>\u00a0\u2013 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 \u043a\u0430\u0436\u0434\u043e\u0433\u043e \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430 \u0432 \u0434\u0438\u0430\u043f\u0430\u0437\u043e\u043d\u0435 [-1, 1] \u043f\u0443\u0442\u0435\u043c \u0434\u0435\u043b\u0435\u043d\u0438\u044f \u043d\u0430 \u043c\u0430\u043a\u0441\u0438\u043c\u0430\u043b\u044c\u043d\u043e\u0435 \u0430\u0431\u0441\u043e\u043b\u044e\u0442\u043d\u043e\u0435 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435.<\/p>\n<\/li>\n<\/ul>\n<p>\u041f\u0440\u0438\u043c\u0435\u043d\u0438\u043c MinMaxScaler \u0434\u043b\u044f \u043d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u0438 \u043d\u0430\u0431\u043e\u0440\u0430 \u0434\u0430\u043d\u043d\u044b\u0445:<\/p>\n<pre><code>import org.apache.spark.ml.feature.MinMaxScaler   val scaler = new MinMaxScaler()   .setInputCol(\"features\")   .setOutputCol(\"scaledFeatures\")   val scaled = scaler.fit(assembled).transform(assembled)<\/code><\/pre>\n<p>scaled\u00a0\u2013 \u044d\u0442\u043e \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441 \u0432\u0435\u043a\u0442\u043e\u0440\u043e\u043c \u043d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u043e\u0432\u0430\u043d\u043d\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0432 \u043a\u043e\u043b\u043e\u043d\u043a\u0435\u00a0<code>scaledFeatures<\/code>:<\/p>\n<pre><code>scaled.select(\"features\", \"scaledFeatures\").show(5, truncate = false) +--------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |features                                                                                                            |scaledFeatures                                                                                                                                                                                                                      | +--------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |(28,[0,1,2,3,4,5,6,7,8,9,12,17,23,25],[45.0,3.0,5.0,1.0,3.0,11914.0,1.335,1144.0,1.625,0.061,1.0,1.0,1.0,1.0])      |(28,[0,1,2,3,4,5,6,7,8,9,12,17,23,25],[0.40425531914893614,0.6000000000000001,0.8,0.16666666666666666,0.5,0.3451163329759801,0.39299381807477185,0.03527317236007566,0.43753365643511044,0.061061061061061066,1.0,1.0,1.0,1.0])     | |(28,[0,1,2,3,4,5,6,7,8,9,10,11,18,20,25],[49.0,5.0,6.0,1.0,2.0,7392.0,1.541,1291.0,3.714,0.105,1.0,1.0,1.0,1.0,1.0])|(28,[0,1,2,3,4,5,6,7,8,9,10,11,18,20,25],[0.48936170212765956,1.0,1.0,0.16666666666666666,0.3333333333333333,0.21409324022831977,0.4536355607889314,0.043451652386780906,1.0,0.10510510510510511,1.0,1.0,1.0,1.0,1.0])              | |(28,[0,1,2,3,5,6,7,8,11,17,22,25],[51.0,3.0,4.0,1.0,3418.0,2.594,1887.0,2.333,1.0,1.0,1.0,1.0])                     |(28,[0,1,2,3,5,6,7,8,11,17,22,25],[0.5319148936170213,0.6000000000000001,0.6000000000000001,0.16666666666666666,0.09894822240894735,0.7636149543715043,0.07661065984199399,0.6281637049003771,1.0,1.0,1.0,1.0])                     | |(28,[0,1,2,3,4,5,6,7,8,9,10,12,19,20,25],[40.0,4.0,3.0,4.0,1.0,796.0,1.405,1171.0,2.333,0.76,1.0,1.0,1.0,1.0,1.0])  |(28,[0,1,2,3,4,5,6,7,8,9,10,12,19,20,25],[0.2978723404255319,0.8,0.4,0.6666666666666666,0.16666666666666666,0.02297684930316113,0.41360023550191344,0.036775342160899074,0.6281637049003771,0.7607607607607608,1.0,1.0,1.0,1.0,1.0])| |(28,[0,1,2,3,5,6,7,8,14,17,23,25],[40.0,3.0,5.0,1.0,4716.0,2.175,816.0,2.5,1.0,1.0,1.0,1.0])                        |(28,[0,1,2,3,5,6,7,8,14,17,23,25],[0.2978723404255319,0.6000000000000001,0.8,0.16666666666666666,0.13655723930113292,0.6402708272004709,0.017024591075998664,0.6731287022078623,1.0,1.0,1.0,1.0])                                   | +--------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+    only showing top 5 rowsFEATURE SELECTION (\u041e\u0422\u0411\u041e\u0420 \u041f\u0420\u0418\u0417\u041d\u0410\u041a\u041e\u0412)\u0412\u0435\u043a\u0442\u043e\u0440 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u043d\u0430\u0448\u0435\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 28 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432. \u042d\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u043c\u043d\u043e\u0433\u043e. \u0422\u0435\u043c \u043d\u0435 \u043c\u0435\u043d\u0435\u0435 \u0440\u0430\u0441\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043f\u0440\u043e\u0446\u0435\u0434\u0443\u0440\u0443 \u043e\u0442\u0431\u043e\u0440\u0430 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u2013 \u0432\u044b\u0434\u0435\u043b\u0435\u043d\u0438\u044f \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u0432\u0430\u0436\u043d\u044b\u0445 \u0438\u0437 \u043d\u0438\u0445UnivariateFeatureSelector\u00a0\u2013 \u044d\u0442\u043e \u0443\u043d\u0438\u0432\u0435\u0440\u0441\u0430\u043b\u044c\u043d\u044b\u0439 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u0432\u044b\u0434\u0435\u043b\u0438\u0442\u044c \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u0432\u0430\u0436\u043d\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438. \u041e\u043d \u0440\u0430\u0431\u043e\u0442\u0430\u0435\u0442 \u0441 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u043c\u0438\/\u043d\u0435\u043f\u0440\u0435\u0440\u044b\u0432\u043d\u044b\u043c\u0438 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438 \u0438 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u043c\u0438\/\u043d\u0435\u043f\u0440\u0435\u0440\u044b\u0432\u043d\u044b\u043c\u0438 \u0446\u0435\u043b\u0435\u0432\u044b\u043c\u0438 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u044b\u043c\u0438. \u0424\u0443\u043d\u043a\u0446\u0438\u044f \u043e\u0446\u0435\u043d\u043a\u0438 \u0432\u044b\u0431\u0438\u0440\u0430\u0435\u0442\u0441\u044f \u0438\u0441\u0445\u043e\u0434\u044f \u0438\u0437 \u0442\u0438\u043f\u0430 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0438 \u0446\u0435\u043b\u0435\u0432\u043e\u0439 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439:featureType labelType score function     categorical categorical chi-squared (chi2)   continuous categorical ANOVATest (f_classif)   continuous continuous F-value (f_regression)   \u041f\u043e\u0434\u0434\u0435\u0440\u0436\u0438\u0432\u0430\u044e\u0442\u0441\u044f \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u0435 \u043c\u0435\u0442\u043e\u0434\u044b \u043e\u0442\u0431\u043e\u0440\u0430:numTopFeatures\u00a0\u2013 \u0444\u0438\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u043e\u0435 \u0447\u0438\u0441\u043b\u043e \u043e\u0442\u0431\u0438\u0440\u0430\u0435\u043c\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432percentile\u00a0\u2013 \u0432\u044b\u0431\u043e\u0440 \u043f\u043e \u043f\u0435\u0440\u0446\u0435\u043d\u0442\u0438\u043b\u044efpr\u00a0\u043e\u0442\u0431\u0438\u0440\u0430\u0435\u0442 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438, p-value \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043d\u0438\u0436\u0435 \u043f\u043e\u0440\u043e\u0433\u043e\u0432\u043e\u0433\u043e \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044ffdr\u00a0\u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442 \u043f\u0440\u043e\u0446\u0435\u0434\u0443\u0440\u0443 \u0411\u0435\u043d\u0434\u0436\u0430\u043c\u0438\u043d\u0438-\u0425\u043e\u0445\u0431\u0435\u0440\u0433\u0430 \u0434\u043b\u044f \u0432\u044b\u0431\u043e\u0440\u0430 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432, \u0447\u0430\u0441\u0442\u043e\u0442\u0430 \u043b\u043e\u0436\u043d\u044b\u0445 \u043e\u0431\u043d\u0430\u0440\u0443\u0436\u0435\u043d\u0438\u0439 \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043d\u0438\u0436\u0435 \u043f\u043e\u0440\u043e\u0433\u043e\u0432\u043e\u0433\u043e \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044ffwe\u00a0\u043e\u0442\u0431\u0438\u0440\u0430\u0435\u0442 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438, p-value \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043d\u0438\u0436\u0435 \u043f\u043e\u0440\u043e\u0433\u043e\u0432\u043e\u0433\u043e \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f. \u041f\u043e\u0440\u043e\u0433\u043e\u0432\u043e\u0435 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u0443\u0435\u0442\u0441\u044f \u043f\u043e\u00a01\/numFeatures\u041f\u0440\u0438\u043c\u0435\u043d\u0438\u043c\u00a0UnivariateFeatureSelector\u00a0\u0441 \u0432\u044b\u0431\u043e\u0440\u043e\u043c \u043f\u043e \u043f\u0435\u0440\u0446\u0435\u043d\u0442\u0438\u043b\u044e \u0441 \u043f\u043e\u0440\u043e\u0433\u043e\u0432\u044b\u043c \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435\u043c 0.75123456789101112import org.apache.spark.ml.feature.UnivariateFeatureSelector\u00a0val selector = new UnivariateFeatureSelector()\u00a0\u00a0.setFeatureType(\"continuous\")\u00a0\u00a0.setLabelType(\"categorical\")\u00a0\u00a0.setSelectionMode(\"percentile\")\u00a0\u00a0.setSelectionThreshold(0.75)\u00a0\u00a0.setFeaturesCol(\"scaledFeatures\")\u00a0\u00a0.setLabelCol(\"target\")\u00a0\u00a0.setOutputCol(\"selectedFeatures\")\u00a0val dataF = selector.fit(scaled).transform(scaled)dataF\u00a0\u2013 \u044d\u0442\u043e \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441 \u0432\u0435\u043a\u0442\u043e\u0440\u043e\u043c \u043e\u0442\u043e\u0431\u0440\u0430\u043d\u043d\u044b\u0445 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0432 \u043a\u043e\u043b\u043e\u043d\u043a\u0435\u00a0selectedFeaturesdataF.select(\"scaledFeatures\", \"selectedFeatures\").show(5, truncate = false)      +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |scaledFeatures                                                                                                                                                                                                                      |selectedFeatures                                                                                                                                                                                      | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |(28,[0,1,2,3,4,5,6,7,8,9,12,17,23,25],[0.40425531914893614,0.6000000000000001,0.8,0.16666666666666666,0.5,0.3451163329759801,0.39299381807477185,0.03527317236007566,0.43753365643511044,0.061061061061061066,1.0,1.0,1.0,1.0])     |(21,[0,1,2,3,4,5,6,7,8,11,14,19],[0.40425531914893614,0.6000000000000001,0.8,0.16666666666666666,0.5,0.39299381807477185,0.03527317236007566,0.43753365643511044,0.061061061061061066,1.0,1.0,1.0])   | |(28,[0,1,2,3,4,5,6,7,8,9,10,11,18,20,25],[0.48936170212765956,1.0,1.0,0.16666666666666666,0.3333333333333333,0.21409324022831977,0.4536355607889314,0.043451652386780906,1.0,0.10510510510510511,1.0,1.0,1.0,1.0,1.0])              |(21,[0,1,2,3,4,5,6,7,8,9,10,15,17],[0.48936170212765956,1.0,1.0,0.16666666666666666,0.3333333333333333,0.4536355607889314,0.043451652386780906,1.0,0.10510510510510511,1.0,1.0,1.0,1.0])              | |(28,[0,1,2,3,5,6,7,8,11,17,22,25],[0.5319148936170213,0.6000000000000001,0.6000000000000001,0.16666666666666666,0.09894822240894735,0.7636149543715043,0.07661065984199399,0.6281637049003771,1.0,1.0,1.0,1.0])                     |(21,[0,1,2,3,5,6,7,10,14],[0.5319148936170213,0.6000000000000001,0.6000000000000001,0.16666666666666666,0.7636149543715043,0.07661065984199399,0.6281637049003771,1.0,1.0])                           | |(28,[0,1,2,3,4,5,6,7,8,9,10,12,19,20,25],[0.2978723404255319,0.8,0.4,0.6666666666666666,0.16666666666666666,0.02297684930316113,0.41360023550191344,0.036775342160899074,0.6281637049003771,0.7607607607607608,1.0,1.0,1.0,1.0,1.0])|(21,[0,1,2,3,4,5,6,7,8,9,11,16,17],[0.2978723404255319,0.8,0.4,0.6666666666666666,0.16666666666666666,0.41360023550191344,0.036775342160899074,0.6281637049003771,0.7607607607607608,1.0,1.0,1.0,1.0])| |(28,[0,1,2,3,5,6,7,8,14,17,23,25],[0.2978723404255319,0.6000000000000001,0.8,0.16666666666666666,0.13655723930113292,0.6402708272004709,0.017024591075998664,0.6731287022078623,1.0,1.0,1.0,1.0])                                   |(21,[0,1,2,3,5,6,7,14,19],[0.2978723404255319,0.6000000000000001,0.8,0.16666666666666666,0.6402708272004709,0.017024591075998664,0.6731287022078623,1.0,1.0])                                         | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ only showing top 5 rows<\/code><\/pre>\n<p>\u041c\u044b \u0441\u043e\u043a\u0440\u0430\u0442\u0438\u043b\u0438 \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u043e \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 \u0441 28 \u0434\u043e 21.<\/p>\n<p>\u041d\u0430 \u044d\u0442\u043e\u043c \u0437\u0430\u043a\u0430\u043d\u0447\u0438\u0432\u0430\u0435\u0442\u0441\u044f \u044d\u0442\u0430\u043f \u043f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043a\u0438 \u0434\u0430\u043d\u043d\u044b\u0445 \u0438 \u043c\u043e\u0436\u043d\u043e \u043f\u0435\u0440\u0435\u0445\u043e\u0434\u0438\u0442\u044c \u043a \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0435\u043c\u0443 \u044d\u0442\u0430\u043f\u0443 \u2013 \u043c\u043e\u0434\u0435\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435<\/p>\n<h3>\u041c\u043e\u0434\u0435\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435<\/h3>\n<p>\u0414\u043b\u044f \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0435\u0439 Spark ML \u043f\u0440\u0435\u0434\u043b\u0430\u0433\u0430\u0435\u0442 \u0442\u0430\u043a\u043e\u0439 \u043d\u0430\u0431\u043e\u0440 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u043e\u0432:<\/p>\n<ul>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-classification-regression.html\"><u>\u041a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u0438 \u0438 \u0440\u0435\u0433\u0440\u0435\u0441\u0441\u0438\u0438<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-clustering.html\"><u>\u041a\u043b\u0430\u0441\u0442\u0435\u0440\u0438\u0437\u0430\u0446\u0438\u0438<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-collaborative-filtering.html\"><u>\u041a\u043e\u043b\u043b\u0430\u0431\u043e\u0440\u0430\u0442\u0438\u0432\u043d\u0430\u044f \u0444\u0438\u043b\u044c\u0442\u0440\u0430\u0446\u0438\u044f<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-frequent-pattern-mining.html\"><u>\u041f\u043e\u0438\u0441\u043a \u0447\u0430\u0441\u0442\u043e \u0432\u0441\u0442\u0440\u0435\u0447\u0430\u044e\u0449\u0438\u0445\u0441\u044f \u0448\u0430\u0431\u043b\u043e\u043d\u043e\u0432<\/u><\/a><\/p>\n<\/li>\n<\/ul>\n<p>\u041f\u0440\u043e\u0434\u0435\u043c\u043e\u043d\u0441\u0442\u0440\u0438\u0440\u0443\u0435\u043c \u044d\u0442\u0430\u043f \u043c\u043e\u0434\u0435\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f \u043d\u0430 \u043d\u0430\u0448\u0435\u043c \u043f\u0440\u0438\u043c\u0435\u0440\u0435.<\/p>\n<h4>\u041e\u0431\u0443\u0447\u0430\u044e\u0449\u0430\u044f \u0438 \u0442\u0435\u0441\u0442\u043e\u0432\u044b\u0435 \u0432\u044b\u0431\u043e\u0440\u043a\u0438<\/h4>\n<p>\u041f\u0435\u0440\u0435\u0434 \u0442\u0435\u043c, \u043a\u0430\u043a \u043f\u0435\u0440\u0435\u0439\u0442\u0438 \u043a \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u044e \u043c\u043e\u0434\u0435\u043b\u0438, \u043d\u0435\u043e\u0431\u0445\u043e\u0434\u0438\u043c\u043e \u0440\u0430\u0437\u0431\u0438\u0442\u044c \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u043d\u0430 \u043e\u0431\u0443\u0447\u0430\u044e\u0449\u0443\u044e \u0438 \u0442\u0435\u0441\u0442\u043e\u0432\u0443\u044e \u0432\u044b\u0431\u043e\u0440\u043a\u0438. \u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0432 Spark \u0435\u0441\u0442\u044c \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u0439 \u043c\u0435\u0442\u043e\u0434\u00a0<code>randomSplit<\/code>, \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u043e\u043c \u043a\u043e\u0442\u043e\u0440\u043e\u0433\u043e \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f \u043c\u0430\u0441\u0441\u0438\u0432 \u0441 \u043f\u0440\u043e\u043f\u043e\u0440\u0446\u0438\u044f\u043c\u0438 \u0440\u0430\u0437\u0434\u0435\u043b\u0435\u043d\u0438\u044f.<\/p>\n<pre><code>val tt = dataF.randomSplit(Array(0.7, 0.3)) val training = tt(0) val test = tt(1)<\/code><\/pre>\n<p>training\u00a0\u2013 \u044d\u0442\u043e \u043e\u0431\u0443\u0447\u0430\u044e\u0449\u0430\u044f \u0432\u044b\u0431\u043e\u0440\u043a\u0430 \u0441 70% \u0437\u0430\u043f\u0438\u0441\u0435\u0439, \u0430\u00a0test\u00a0\u2013 \u044d\u0442\u043e \u0442\u0435\u0441\u0442\u043e\u0432\u0430\u044f \u0432\u044b\u0431\u043e\u0440\u043a\u0430 \u0441, \u0441\u043e\u043e\u0442\u0432\u0435\u0442\u0441\u0442\u0432\u0435\u043d\u043d\u043e, 30% \u0437\u0430\u043f\u0438\u0441\u0435\u0439.<\/p>\n<h4>\u041b\u043e\u0433\u0438\u0441\u0442\u0438\u0447\u0435\u0441\u043a\u0430\u044f \u0440\u0435\u0433\u0440\u0435\u0441\u0441\u0438\u044f<\/h4>\n<p>\u041c\u044b \u0440\u0435\u0448\u0430\u0435\u043c \u0437\u0430\u0434\u0430\u0447\u0443 \u0431\u0438\u043d\u0430\u0440\u043d\u043e\u0439 \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u0438. \u0411\u0443\u0434\u0435\u043c \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c \u043b\u043e\u0433\u0438\u0441\u0442\u0438\u0447\u0435\u0441\u043a\u0443\u044e \u0440\u0435\u0433\u0440\u0435\u0441\u0441\u0438\u044e, \u043a\u0430\u043a \u0445\u043e\u0440\u043e\u0448\u043e \u0437\u0430\u0440\u0435\u043a\u043e\u043c\u0435\u043d\u0434\u043e\u0432\u0430\u0432\u0448\u0438\u0439 \u0441\u0435\u0431\u044f \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c.<\/p>\n<p>\u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u043d\u0430\u0434\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c \u043e\u0431\u044a\u0435\u043a\u0442\u00a0<code>LogisticRegression<\/code>, \u043e\u0441\u043d\u043e\u0432\u043d\u044b\u043c\u0438 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u0430\u043c\u0438 \u043a\u043e\u0442\u043e\u0440\u043e\u0433\u043e \u044f\u0432\u043b\u044f\u044e\u0442\u0441\u044f:<\/p>\n<ul>\n<li>\n<p><code>elasticNetParam<\/code>\u00a0\u2013\u00a0\u03b1<\/p>\n<\/li>\n<li>\n<p><code>regParam<\/code>\u00a0\u2013\u00a0\u03bb<\/p>\n<\/li>\n<\/ul>\n<p>\u0412\u044b\u0431\u0435\u0440\u0435\u043c \u0434\u043b\u044f \u043d\u0430\u0447\u0430\u043b\u0430 \u044d\u0442\u0438 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u043f\u0440\u043e\u0438\u0437\u0432\u043e\u043b\u044c\u043d\u044b\u043c \u043e\u0431\u0440\u0430\u0437\u043e\u043c.<\/p>\n<pre><code>import org.apache.spark.ml.classification.LogisticRegression   val lr = new LogisticRegression()         .setMaxIter(1000)         .setRegParam(0.2)         .setElasticNetParam(0.8)         .setFamily(\"binomial\")         .setFeaturesCol(\"selectedFeatures\")         .setLabelCol(\"target\")   val lrModel = lr.fit(training)<\/code><\/pre>\n<p><em>lrModel\u00a0\u2013<\/em> \u044d\u0442\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u043d\u0430\u044f \u043c\u043e\u0434\u0435\u043b\u044c.<\/p>\n<h4>TRAINING SUMMARY<\/h4>\n<p>\u041c\u044b \u043c\u043e\u0436\u0435\u043c \u043f\u043e\u043b\u0443\u0447\u0438\u0442\u044c \u043e\u0441\u043d\u043e\u0432\u043d\u0443\u044e \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u044e \u043e\u0431 \u043e\u0431\u0443\u0447\u0435\u043d\u043d\u043e\u0439 \u043c\u043e\u0434\u0435\u043b\u0438. \u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043e\u0431\u044a\u0435\u043a\u0442\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/classification\/BinaryLogisticRegressionTrainingSummary.html\"><u>BinaryLogisticRegressionTrainingSummary<\/u><\/a>:<\/p>\n<pre><code>val trainingSummary = lrModel.binarySummary   println(s\"accuracy: ${trainingSummary.accuracy}\") println(s\"areaUnderROC: ${trainingSummary.areaUnderROC}\")<\/code><\/pre>\n<pre><code>accuracy: 0.6986124278203912 areaUnderROC: 0.7455570759572957<\/code><\/pre>\n<p>\u041c\u044b \u043f\u043e\u043b\u0443\u0447\u0438\u043b\u0438 AUROC \u043f\u0440\u0438\u043c\u0435\u0440\u043d\u043e 0.75, \u0447\u0442\u043e, \u0432 \u043f\u0440\u0438\u043d\u0446\u0438\u043f\u0435, \u043d\u0435\u043f\u043b\u043e\u0445\u043e.<\/p>\n<h2>\u041e\u0446\u0435\u043d\u043a\u0430<\/h2>\n<h3>\u041f\u0440\u043e\u0432\u0435\u0440\u044f\u0435\u043c \u043c\u043e\u0434\u0435\u043b\u044c \u043d\u0430 \u0442\u0435\u0441\u0442\u043e\u0432\u043e\u0439 \u0432\u044b\u0431\u043e\u0440\u043a\u0435<\/h3>\n<p>\u041f\u0440\u0438\u043c\u0435\u043d\u0438\u043c \u043e\u0431\u0443\u0447\u0435\u043d\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c \u043a \u0442\u0435\u0441\u0442\u043e\u0432\u043e\u0439 \u0432\u044b\u0431\u043e\u0440\u043a\u0435 \u0438 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442.<\/p>\n<pre><code>val predicted = lrModel.transform(test)<\/code><\/pre>\n<p>\u041d\u0430\u0431\u043e\u0440\u00a0<code>predicted<\/code>\u00a0\u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u043d\u043e\u0432\u044b\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438:\u00a0<code>rawPrediction<\/code>,\u00a0<code>probability<\/code>\u00a0\u0438\u00a0<code>prediction<\/code>:<\/p>\n<pre><code>predicted.select(\"target\", \"rawPrediction\", \"probability\", \"prediction\").show(10, truncate = false) +------+----------------------------------------------+----------------------------------------+----------+ |target|rawPrediction                                 |probability                             |prediction| +------+----------------------------------------------+----------------------------------------+----------+ |0     |[0.040262722641592585,-0.040262722641592585]  |[0.5100643211022606,0.48993567889773937]|0.0       | |0     |[-0.009994173386193073,0.009994173386193073]  |[0.4975014774501823,0.5024985225498177] |1.0       | |0     |[0.18939904012242004,-0.18939904012242004]    |[0.547208721739737,0.452791278260263]   |0.0       | |0     |[0.057015021317521175,-0.057015021317521175]  |[0.5142498953455751,0.4857501046544249] |0.0       | |0     |[-0.030423805917813296,0.030423805917813296]  |[0.4923946351436886,0.5076053648563115] |1.0       | |0     |[-0.023886323507694818,0.023886323507694818]  |[0.49402870303387675,0.5059712969661232]|1.0       | |0     |[-0.05167062375069831,0.05167062375069831]    |[0.4870852173158024,0.5129147826841975] |1.0       | |0     |[0.0026721987834114613,-0.0026721987834114613]|[0.5006680492983275,0.49933195070167247]|0.0       | |0     |[-0.05085343844943349,0.05085343844943349]    |[0.48728937948478424,0.5127106205152158]|1.0       | |0     |[-0.026746472062121662,0.026746472062121662]  |[0.4933137805752158,0.5066862194247842] |1.0       | +------+----------------------------------------------+----------------------------------------+----------+ only showing top 10 rows<\/code><\/pre>\n<p>\u0412 \u0438\u0434\u0435\u0430\u043b\u0435 \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u0432 \u043a\u043e\u043b\u043e\u043d\u043a\u0430\u0445\u00a0<code>target<\/code>\u00a0\u0438\u00a0<code>prediction<\/code>\u00a0\u0434\u043e\u043b\u0436\u043d\u044b \u0441\u043e\u0432\u043f\u0430\u0434\u0430\u0442\u044c. \u041d\u043e, \u043a\u0430\u043a \u043c\u044b \u0432\u0438\u0434\u0438\u043c, \u0440\u0430\u0437\u043d\u0438\u0446\u0430 \u0435\u0441\u0442\u044c \u0434\u0430\u0436\u0435 \u0432 \u043f\u0435\u0440\u0432\u044b\u0445 \u0434\u0435\u0441\u044f\u0442\u0438 \u0437\u0430\u043f\u0438\u0441\u044f\u0445.<\/p>\n<p>\u0414\u043b\u044f \u043e\u0446\u0435\u043d\u043a\u0438 \u043f\u0440\u0438\u043c\u0435\u043d\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0438 \u043a \u0442\u0435\u0441\u0442\u043e\u0432\u043e\u0439 \u0432\u044b\u0431\u043e\u0440\u043a\u0435 \u043c\u043e\u0436\u043d\u043e \u0432\u043e\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u0441\u044f \u043e\u0431\u044a\u0435\u043a\u0442\u043e\u043c\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/BinaryClassificationEvaluator.html\"><u>BinaryClassificationEvaluator<\/u><\/a>:<\/p>\n<pre><code>import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator   val evaluator = new BinaryClassificationEvaluator().setLabelCol(\"target\")   println(s\"areaUnderROC: ${evaluator.evaluate(predicted)}\\n\")<\/code><\/pre>\n<pre><code>areaUnderROC: 0.7445924078797251<\/code><\/pre>\n<p><em>AUROC \u043d\u0430 \u0442\u0435\u0441\u0442\u043e\u0432\u043e\u0439 \u0432\u044b\u0431\u043e\u0440\u043a\u0435 \u0442\u043e\u0436\u0435 \u043f\u0440\u0438\u043c\u0435\u0440\u043d\u043e 0.75<\/em><\/p>\n<h4>CONFUSION MATRIX (\u041c\u0430\u0442\u0440\u0438\u0446\u0430 \u043e\u0448\u0438\u0431\u043e\u043a)<\/h4>\n<p>\u041f\u043e\u043b\u0435\u0437\u043d\u044b\u043c \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u043c \u043e\u0446\u0435\u043d\u043a\u0438 \u043c\u043e\u0434\u0435\u043b\u0438 \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f\u00a0\u041c\u0430\u0442\u0440\u0438\u0446\u0430 \u043e\u0448\u0438\u0431\u043e\u043a.<\/p>\n<ul>\n<li>\n<p><strong>True Positive (TP)<\/strong>\u00a0\u2013 label is positive and prediction is also positive;<\/p>\n<\/li>\n<li>\n<p><strong>True Negative (TN)<\/strong>\u00a0\u2013 label is negative and prediction is also negative;<\/p>\n<\/li>\n<li>\n<p><strong>False Positive (FP)<\/strong>\u00a0\u2013 label is negative but prediction is positive;<\/p>\n<\/li>\n<li>\n<p><strong>False Negative (FN)<\/strong>\u00a0\u2013 label is positive but prediction is negative.<\/p>\n<\/li>\n<\/ul>\n<p>\u0412 Spark ML \u043d\u0435\u0442 \u043c\u0435\u0442\u043e\u0434\u043e\u0432, \u0432\u044b\u0447\u0438\u0441\u043b\u044f\u044e\u0449\u0438\u0445 \u043c\u0430\u0442\u0440\u0438\u0446\u0443 \u043e\u0448\u0438\u0431\u043e\u043a \u043d\u0435\u043f\u043e\u0441\u0440\u0435\u0434\u0441\u0442\u0432\u0435\u043d\u043d\u043e, \u043d\u043e \u0435\u0451 \u043b\u0435\u0433\u043a\u043e \u0432\u044b\u0447\u0438\u0441\u043b\u0438\u0442\u044c \u043d\u0435\u043f\u043e\u0441\u0440\u0435\u0434\u0441\u0442\u0432\u0435\u043d\u043d\u043e:<\/p>\n<pre><code>val tp = predicted.filter(($\"target\" === 1) and ($\"prediction\" === 1)).count val tn = predicted.filter(($\"target\" === 0) and ($\"prediction\" === 0)).count val fp = predicted.filter(($\"target\" === 0) and ($\"prediction\" === 1)).count val fn = predicted.filter(($\"target\" === 1) and ($\"prediction\" === 0)).count   println(s\"Confusion Matrix:\\n$tp\\t$fp\\n$fn\\t$tn\\n\")<\/code><\/pre>\n<pre><code>Confusion Matrix: 1272309 11982253<\/code><\/pre>\n<p>\u0416\u0435\u043b\u0430\u0442\u0435\u043b\u044c\u043d\u043e, \u0447\u0442\u043e\u0431\u044b \u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0430 \u0433\u043b\u0430\u0432\u043d\u043e\u0439 \u0434\u0438\u0430\u0433\u043e\u043d\u0430\u043b\u0438 \u043c\u0430\u0442\u0440\u0438\u0446\u044b \u0431\u044b\u043b\u0438 \u0431\u043e\u043b\u044c\u0448\u0438\u043c\u0438, \u0430 \u043d\u0430 \u043f\u043e\u0431\u043e\u0447\u043d\u043e\u0439 \u2013 \u043c\u0430\u043b\u0435\u043d\u044c\u043a\u0438\u043c\u0438.<\/p>\n<h4>ACCURACY, PRECISION, RECALL<\/h4>\n<p>\u0421\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u043c\u0438 \u0448\u0438\u0440\u043e\u043a\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u043c\u044b\u043c\u0438 \u043c\u0435\u0442\u0440\u0438\u043a\u0430\u043c\u0438 \u043e\u0446\u0435\u043d\u043a\u0438 \u043a\u0430\u0447\u0435\u0441\u0442\u0432\u0430 \u044f\u0432\u043b\u044f\u044e\u0442\u0441\u044f:<\/p>\n<ul>\n<li>\n<p><strong>Accuracy<\/strong>\u00a0(\u0434\u043e\u043b\u044f \u043f\u0440\u0430\u0432\u0438\u043b\u044c\u043d\u044b\u0445 \u043e\u0442\u0432\u0435\u0442\u043e\u0432) = TP + TN \/ TP + TN + FP + FN<\/p>\n<\/li>\n<li>\n<p><strong>Precision<\/strong>\u00a0(\u0442\u043e\u0447\u043d\u043e\u0441\u0442\u044c) = TP \/ TP + FP<\/p>\n<\/li>\n<li>\n<p><strong>Recall<\/strong>\u00a0(\u043f\u043e\u043b\u043d\u043e\u0442\u0430) = TP \/ TP + FN<\/p>\n<\/li>\n<\/ul>\n<p>\u0418\u0445 \u043b\u0435\u0433\u043a\u043e \u0432\u044b\u0447\u0438\u0441\u043b\u0438\u0442\u044c \u043f\u043e \u043c\u0430\u0442\u0440\u0438\u0446\u0435 \u043e\u0448\u0438\u0431\u043e\u043a:<\/p>\n<pre><code>val accuracy = (tp + tn) \/ (tp + tn + fp + fn).toDouble val precision = tp \/ (tp + fp).toDouble val recall = tp \/ (tp + fn).toDouble   println(s\"Accuracy = $accuracy\") println(s\"Precision = $precision\") println(s\"Recall = $recall\\n\")<\/code><\/pre>\n<pre><code>Accuracy = 0.700516693163752 Precision = 0.8045540796963947 Recall = 0.5149797570850202<\/code><\/pre>\n<h3>\u041d\u0430\u0441\u0442\u0440\u043e\u0439\u043a\u0430 \u043c\u043e\u0434\u0435\u043b\u0435\u0439<\/h3>\n<figure class=\"full-width\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/01d\/8aa\/f9b\/01d8aaf9b32801448e759cd08ed72dc7.png\" alt=\"\" title=\"\" width=\"844\" height=\"783\" data-src=\"https:\/\/habrastorage.org\/getpro\/habr\/upload_files\/01d\/8aa\/f9b\/01d8aaf9b32801448e759cd08ed72dc7.png\"\/><figcaption><\/figcaption><\/figure>\n<h3>\u041f\u043e\u0434\u0431\u043e\u0440 \u0433\u0438\u043f\u0435\u0440\u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432<\/h3>\n<p>\u041f\u0440\u0438 \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u0438 \u043d\u0430\u0448\u0435\u0439 \u043c\u043e\u0434\u0435\u043b\u0438 \u043c\u044b \u0432\u044b\u0431\u0438\u0440\u0430\u043b\u0438 \u0440\u0435\u0433\u0443\u043b\u044f\u0440\u0438\u0437\u0430\u0446\u0438\u043e\u043d\u043d\u044b\u0435 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u043f\u0440\u043e\u0438\u0437\u0432\u043e\u043b\u044c\u043d\u044b\u043c \u043e\u0431\u0440\u0430\u0437\u043e\u043c. \u0414\u0430\u0432\u0430\u0439\u0442\u0435 \u0442\u0435\u043f\u0435\u0440\u044c \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043a\u0430\u043a \u043c\u043e\u0436\u043d\u043e \u043f\u043e\u0434\u043e\u0431\u0440\u0430\u0442\u044c \u043e\u043f\u0442\u0438\u043c\u0430\u043b\u044c\u043d\u044b\u0435 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u0434\u043b\u044f \u043c\u043e\u0434\u0435\u043b\u0438.<\/p>\n<p>\u0414\u043b\u044f \u043f\u043e\u0434\u0431\u043e\u0440\u0430 \u0433\u0438\u043f\u0435\u0440\u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432 (\u0432\u044b\u0431\u043e\u0440\u0430 \u043c\u043e\u0434\u0435\u043b\u0438) Spark ML \u043f\u0440\u0435\u0434\u043b\u0430\u0433\u0430\u0435\u0442 \u0434\u0432\u0430 \u0438\u043d\u0441\u0442\u0440\u0443\u043c\u0435\u043d\u0442\u0430:\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/tuning\/CrossValidator.html\"><u>CrossValidator<\/u><\/a>\u00a0\u0438\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/tuning\/TrainValidationSplit.html\"><u>TrainValidationSplit<\/u><\/a>.<\/p>\n<p>\u0412 \u043e\u0431\u043e\u0438\u0445 \u0441\u043b\u0443\u0447\u0430\u044f\u0445 \u0442\u0440\u0435\u0431\u0443\u0435\u0442\u0441\u044f \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u0438\u0442\u044c:<\/p>\n<ul>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/Estimator.html\"><u>Estimator<\/u><\/a>\u00a0\u2013 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u043d\u0430\u0434\u043e \u043d\u0430\u0441\u0442\u0440\u043e\u0438\u0442\u044c;<\/p>\n<\/li>\n<li>\n<p>\u041d\u0430\u0431\u043e\u0440 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432: \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u0434\u043b\u044f \u0432\u044b\u0431\u043e\u0440\u0430 (\u201c\u0441\u0435\u0442\u043a\u0430 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432\u201d);<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/Evaluator.html\"><u>Evaluator<\/u><\/a>\u00a0\u2013 \u043e\u0431\u044a\u0435\u043a\u0442 \u0434\u043b\u044f \u043e\u0446\u0435\u043d\u043a\u0438 \u043c\u043e\u0434\u0435\u043b\u0438.<\/p>\n<\/li>\n<\/ul>\n<p>\u0412 \u043e\u0431\u0449\u0435\u043c \u0441\u043b\u0443\u0447\u0430\u0435 \u043f\u0440\u043e\u0446\u0435\u0441\u0441 \u043f\u043e\u0434\u0431\u043e\u0440\u0430 \u0433\u0438\u043f\u0435\u0440\u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432 \u0432\u044b\u0433\u043b\u044f\u0434\u0438\u0442 \u0442\u0430\u043a:<\/p>\n<ul>\n<li>\n<p>\u041d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0440\u0430\u0437\u0431\u0438\u0432\u0430\u0435\u0442\u0441\u044f \u043d\u0430 \u043e\u0431\u0443\u0447\u0430\u044e\u0449\u0443\u044e \u0438 \u0442\u0435\u0441\u0442\u043e\u0432\u0443\u044e \u0432\u044b\u0431\u043e\u0440\u043a\u0438;<\/p>\n<\/li>\n<li>\n<p>\u0414\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0439 \u043f\u0430\u0440\u044b\u00a0(training, test)\u00a0\u043f\u0435\u0440\u0435\u0431\u0438\u0440\u0430\u044e\u0442\u0441\u044f \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u0438\u0437 \u0441\u0435\u0442\u043a\u0438 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432;<\/p>\n<\/li>\n<li>\n<p>\u0414\u043b\u044f \u043a\u0430\u0436\u0434\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u043f\u0430\u0440\u043c\u0435\u0442\u0440\u043e\u0432 \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0435\u0442\u0441\u044f\u00a0Estimator\u00a0\u0434\u043b\u044f \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0438;<\/p>\n<\/li>\n<li>\n<p>Evaluator\u00a0\u043e\u0446\u0435\u043d\u0438\u0432\u0430\u0435\u0442 \u043a\u0430\u0436\u0434\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c;<\/p>\n<\/li>\n<li>\n<p>\u0412\u044b\u0431\u0438\u0440\u0430\u0435\u0442\u0441\u044f \u043c\u043e\u0434\u0435\u043b\u044c \u0441 \u043b\u0443\u0447\u0448\u0438\u043c\u0438 \u043f\u043e\u043a\u0430\u0437\u0430\u0442\u0435\u043b\u044f\u043c\u0438.<\/p>\n<\/li>\n<\/ul>\n<p>\u0412 \u043a\u0430\u0447\u0435\u0441\u0442\u0432\u0435\u00a0Evaluator\u00a0\u043c\u043e\u0436\u0435\u0442 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u0441\u044f:<\/p>\n<ul>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/RegressionEvaluator.html\"><u>RegressionEvaluator<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/BinaryClassificationEvaluator.html\"><u>BinaryClassificationEvaluator<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/MulticlassClassificationEvaluator.html\"><u>MulticlassClassificationEvaluator<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/MultilabelClassificationEvaluator.html\"><u>MultilabelClassificationEvaluator<\/u><\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/evaluation\/RankingEvaluator.html\"><u>RankingEvaluator<\/u><\/a><\/p>\n<\/li>\n<\/ul>\n<p>\u0414\u043b\u044f \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u044f \u0441\u0435\u0442\u043a\u0438 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043e\u0431\u044a\u0435\u043a\u0442\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/ml\/tuning\/ParamGridBuilder.html\"><u>ParamGridBuilder<\/u><\/a>.<\/p>\n<p><code>CrossValidator<\/code>\u00a0\u0440\u0430\u0437\u0431\u0438\u0432\u0430\u0435\u0442 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u043d\u0430 \u043d\u0430\u0431\u043e\u0440\u00a0folds, \u0441\u043e\u0447\u0435\u0442\u0430\u043d\u0438\u044f \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u044e\u0442\u0441\u044f \u0434\u043b\u044f \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f \u0438 \u0442\u0435\u0441\u0442\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f. \u041e\u0446\u0435\u043d\u043a\u0430 \u043c\u043e\u0434\u0435\u043b\u0438 \u043f\u0440\u043e\u0445\u043e\u0434\u0438\u0442 \u0434\u043b\u044f \u0432\u0441\u0435\u0445 \u0441\u043e\u0447\u0435\u0442\u0430\u043d\u0438\u0439\u00a0folds.<\/p>\n<p><code>TrainValidationSplit<\/code>\u00a0\u0440\u0430\u0437\u0431\u0438\u0432\u0430\u0435\u0442 \u043d\u0430\u0431\u043e\u0440 \u043d\u0430 \u043e\u0431\u0443\u0447\u0430\u044e\u0449\u0443\u044e \u0438 \u0442\u0435\u0441\u0442\u043e\u0432\u0443\u044e \u0432\u044b\u0431\u043e\u0440\u043a\u0443 \u0438 \u043e\u0446\u0435\u043d\u0438\u0432\u0430\u0435\u0442 \u043c\u043e\u0434\u0435\u043b\u044c \u043d\u0430 \u044d\u0442\u043e\u043c \u0440\u0430\u0437\u0431\u0438\u0435\u043d\u0438\u0435.<\/p>\n<p>\u0414\u043b\u044f \u043f\u043e\u0434\u0431\u043e\u0440\u0430 \u0433\u0438\u043f\u0435\u0440\u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u043e\u0432 \u0431\u0443\u0434\u0435\u043c \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u00a0<code>TrainValidationSplit<\/code>:<\/p>\n<pre><code>import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}   val paramGrid = new ParamGridBuilder()   .addGrid(lr.regParam, Array(0.01, 0.1, 0.5))   .addGrid(lr.fitIntercept)   .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))   .build()   val trainValidationSplit = new TrainValidationSplit()   .setEstimator(lr)   .setEvaluator(evaluator)   .setEstimatorParamMaps(paramGrid)   .setTrainRatio(0.7)   .setParallelism(2)   val model = trainValidationSplit.fit(dataF)<\/code><\/pre>\n<p>\u041b\u0443\u0447\u0448\u0430\u044f \u043c\u043e\u0434\u0435\u043b\u044c \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f \u0432\u00a0<code>bestmodel<\/code>:<\/p>\n<pre><code>model.bestModel.extractParamMap() res89: org.apache.spark.ml.param.ParamMap = { logreg_2eef3ae8c923-aggregationDepth: 2, logreg_2eef3ae8c923-elasticNetParam: 0.0, logreg_2eef3ae8c923-family: binomial, logreg_2eef3ae8c923-featuresCol: selectedFeatures, logreg_2eef3ae8c923-fitIntercept: true, logreg_2eef3ae8c923-labelCol: target, logreg_2eef3ae8c923-maxBlockSizeInMB: 0.0, logreg_2eef3ae8c923-maxIter: 1000, logreg_2eef3ae8c923-predictionCol: prediction, logreg_2eef3ae8c923-probabilityCol: probability, logreg_2eef3ae8c923-rawPredictionCol: rawPrediction, logreg_2eef3ae8c923-regParam: 0.01, logreg_2eef3ae8c923-standardization: true, logreg_2eef3ae8c923-threshold: 0.5, logreg_2eef3ae8c923-tol: 1.0E-6 }<\/code><\/pre>\n<p>\u0421\u043e\u0445\u0440\u0430\u043d\u0438\u043c \u043b\u0443\u0447\u0448\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c \u0434\u043b\u044f \u0434\u0430\u043b\u044c\u043d\u0435\u0439\u0448\u0435\u0433\u043e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f:<\/p>\n<pre><code>val bestML = model.bestModel<\/code><\/pre>\n<h3>\u0412\u043d\u0435\u0434\u0440\u0435\u043d\u0438\u0435<\/h3>\n<h4>ML PIPELINES<\/h4>\n<p>\u0427\u0442\u043e \u0432\u0430\u0436\u043d\u043e \u0434\u043b\u044f \u0432\u043d\u0435\u0434\u0440\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0435\u0439? \u0411\u0435\u0437\u043e\u0448\u0438\u0431\u043e\u0447\u043d\u0430\u044f \u043f\u043e\u0432\u0442\u043e\u0440\u044f\u0435\u043c\u043e\u0441\u0442\u044c.<\/p>\n<p>\u0414\u0430\u0432\u0430\u0439\u0442\u0435 \u0432\u0441\u043f\u043e\u043c\u043d\u0438\u043c \u0432\u0441\u0435 \u044d\u0442\u0430\u043f\u044b \u043f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043a\u0438 \u0438 \u0440\u0430\u0441\u0447\u0451\u0442\u0430 \u043c\u043e\u0434\u0435\u043b\u0435\u0439:<\/p>\n<ol>\n<li>\n<p>\u041e\u0442\u043e\u0431\u0440\u0430\u043b\u0438 \u0447\u0438\u0441\u043b\u043e\u0432\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 (<code>numericColumnsFinal<\/code>);<\/p>\n<\/li>\n<li>\n<p>\u041f\u0440\u043e\u0438\u043d\u0434\u0435\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043b\u0438 \u0441\u0442\u0440\u043e\u043a\u043e\u0432\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 (<code>indexer<\/code>);<\/p>\n<\/li>\n<li>\n<p>\u0417\u0430\u043a\u043e\u0434\u0438\u0440\u043e\u0432\u0430\u043b\u0438 \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 (<code>encoder<\/code>);<\/p>\n<\/li>\n<li>\n<p>\u0421\u043e\u0431\u0440\u0430\u043b\u0438 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 \u0432 \u0432\u0435\u043a\u0442\u043e\u0440 (<code>assembler<\/code>);<\/p>\n<\/li>\n<li>\n<p>\u041d\u043e\u0440\u043c\u0430\u043b\u0438\u0437\u043e\u0432\u0430\u043b\u0438 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438 (<code>scaler<\/code>);<\/p>\n<\/li>\n<li>\n<p>\u041f\u0440\u043e\u0432\u0435\u043b\u0438 \u043e\u0442\u0431\u043e\u0440 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u0432 (<code>selector<\/code>);<\/p>\n<\/li>\n<li>\n<p>\u0420\u0430\u0441\u0441\u0447\u0438\u0442\u0430\u043b\u0438 \u043c\u043e\u0434\u0435\u043b\u044c (<code>bestML<\/code>).<\/p>\n<\/li>\n<\/ol>\n<p>\u041f\u0440\u0435\u0436\u0434\u0435, \u0447\u0435\u043c \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0442\u044c \u0440\u0430\u0441\u0447\u0438\u0442\u0430\u043d\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c, \u043c\u044b \u0434\u043e\u043b\u0436\u043d\u044b \u043f\u0440\u0438\u043c\u0435\u043d\u0438\u0442\u044c \u0432\u0435\u0441\u044c \u043d\u0430\u0431\u043e\u0440 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u0439 \u043a \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445. \u041f\u0440\u0438 \u043f\u043e\u0432\u0442\u043e\u0440\u0435\u043d\u0438\u0438 \u0440\u0430\u0441\u0447\u0451\u0442\u043e\u0432 \u043b\u0435\u0433\u043a\u043e \u043e\u0448\u0438\u0431\u0438\u0442\u044c\u0441\u044f \u0432 \u044d\u0442\u0438\u0445 \u044d\u0442\u0430\u043f\u0430\u0445 \u0438\u043b\u0438, \u0434\u0430\u0436\u0435, \u043f\u0440\u043e\u043f\u0443\u0441\u0442\u0438\u0442\u044c \u043a\u0430\u043a\u043e\u0439-\u043d\u0438\u0431\u0443\u0442\u044c \u0438\u0437 \u043d\u0438\u0445.<\/p>\n<p>\u0425\u043e\u0440\u043e\u0448\u043e \u0431\u044b \u043f\u043e\u0441\u0442\u0440\u043e\u0438\u0442\u044c \u043c\u043e\u0434\u0435\u043b\u044c, \u0432\u043a\u043b\u044e\u0447\u0430\u044e\u0449\u0443\u044e \u0432 \u0441\u0435\u0431\u044f \u0432\u0441\u0435 \u043d\u0435\u043e\u0431\u0445\u043e\u0434\u0438\u043c\u044b\u0435 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u044f.<\/p>\n<p><strong>ML Pipelines<\/strong>\u00a0\u043f\u043e\u0437\u0432\u043e\u043b\u044f\u044e\u0442 \u043e\u0431\u044a\u0435\u0434\u0438\u043d\u0438\u0442\u044c \u0432\u0441\u0435 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u0438 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u044b \u0432 \u043e\u0434\u0438\u043d \u043a\u043e\u043d\u0432\u0435\u0439\u0435\u0440 \u0438\u043b\u0438 \u0440\u0430\u0431\u043e\u0447\u0438\u0439 \u043f\u0440\u043e\u0446\u0435\u0441\u0441:<\/p>\n<pre><code>import org.apache.spark.ml.Pipeline   val pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, scaler, selector, bestML))<\/code><\/pre>\n<p>\u0422\u0435\u043f\u0435\u0440\u044c, \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u044f Pipeline, \u043c\u044b \u043c\u043e\u0436\u0435\u043c \u043f\u043e\u0441\u0442\u0440\u043e\u0438\u0442\u044c \u043c\u043e\u0434\u0435\u043b\u044c, \u0432\u043a\u043b\u044e\u0447\u0430\u044e\u0449\u0443\u044e \u0432\u0441\u0435 \u043d\u0435\u043e\u0431\u0445\u043e\u0434\u0438\u043c\u044b\u0435 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u044f.<\/p>\n<pre><code>val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))   val pipelineModel = pipeline.fit(trainingData)<\/code><\/pre>\n<h4>ML PERSISTENCE<\/h4>\n<p>\u0427\u0442\u043e\u0431\u044b \u043f\u0435\u0440\u0435\u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c \u043f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043b\u0435\u043d\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c \u043d\u0443\u0436\u043d\u0430 \u0432\u043e\u0437\u043c\u043e\u0436\u043d\u043e\u0441\u0442\u044c \u0441\u043e\u0445\u0440\u0430\u043d\u044f\u0442\u044c \u0438 \u0437\u0430\u0433\u0440\u0443\u0436\u0430\u0442\u044c \u0438\u0445. \u042d\u0442\u043e \u043e\u0431\u0435\u0441\u043f\u0435\u0447\u0438\u0432\u0430\u0435\u0442\u00a0<strong>ML persistence<\/strong>.<\/p>\n<p>\u0421\u043e\u0445\u0440\u0430\u043d\u0438\u043c \u043a\u043e\u043d\u0432\u0435\u0439\u0435\u0440\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c (<code>PipelineModel<\/code>):<\/p>\n<pre><code>pipelineModel.write.overwrite().save(s\"$basePath\/pipelineModel\")<\/code><\/pre>\n<h4>SPARK ML PRODUCTION<\/h4>\n<p>\u0421\u043e\u0445\u0440\u0430\u043d\u0451\u043d\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c \u043c\u043e\u0436\u043d\u043e \u0437\u0430\u0433\u0440\u0443\u0436\u0430\u0442\u044c \u0438 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c \u043e\u0442\u0434\u0435\u043b\u044c\u043d\u043e \u043e\u0442 \u0438\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u0442\u0435\u043b\u044c\u0441\u043a\u043e\u0433\u043e \u043f\u0440\u043e\u0435\u043a\u0442\u0430, \u0432 \u043a\u043e\u0442\u043e\u0440\u043e\u043c \u043c\u044b \u0435\u0451 \u043f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u0438\u043b\u0438.<\/p>\n<p>\u0417\u0430\u0433\u0440\u0443\u0437\u0438\u043c \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 (\u043c\u044b \u0431\u0443\u0434\u0435\u043c \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c \u0442\u043e\u0442 \u0436\u0435 \u0441\u0430\u043c\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445, \u043d\u043e \u043d\u0430 \u043f\u0440\u0430\u043a\u0442\u0438\u043a\u0435 \u043e\u0431\u0443\u0447\u0435\u043d\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u044e\u0442 \u043a \u043d\u043e\u0432\u043e\u043c\u0443 \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445), \u0437\u0430\u0433\u0440\u0443\u0437\u0438\u043c \u043a\u043e\u043d\u0432\u0435\u0439\u0435\u0440\u043d\u0443\u044e \u043c\u043e\u0434\u0435\u043b\u044c \u0438 \u043f\u0440\u0438\u043c\u0435\u043d\u0438\u043c \u0435\u0451 \u043a \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445:<\/p>\n<pre><code>val data = spark         .read         .option(\"header\", \"true\")         .option(\"inferSchema\", \"true\")         .csv(s\"$basePath\/data\/BankChurners.csv\")   import org.apache.spark.ml.PipelineModel   val model = PipelineModel.load(s\"$basePath\/pipelineModel\")   val prediction = model.transform(data)<\/code><\/pre>\n<p><code>prediction<\/code>\u00a0\u2013 \u044d\u0442\u043e \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0438\u0441\u0445\u043e\u0434\u043d\u044b\u0435 \u0434\u0430\u043d\u043d\u044b\u0435, \u0434\u0430\u043d\u043d\u044b\u0435, \u043f\u043e\u043b\u0443\u0447\u0435\u043d\u043d\u044b\u0435 \u0432 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u0435 \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u0439, \u0438 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442 \u043f\u0440\u0438\u043c\u0435\u043d\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0438 \u2013 \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u0430\u043d\u0438\u0435.<\/p>\n<pre><code>prediction.show(5) +---------+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------+--------------+-----------------------+---------------------+-----------------------------+----------------------------+--------------------+-----------------------------+---------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+ |CLIENTNUM|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1|Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2|Marital_Status_Indexed|Income_Category_Indexed|Gender_Indexed|Education_Level_Indexed|Card_Category_Indexed|Income_Category_Indexed_Coded|Marital_Status_Indexed_Coded|Gender_Indexed_Coded|Education_Level_Indexed_Coded|Card_Category_Indexed_Coded|            features|      scaledFeatures|    selectedFeatures|       rawPrediction|         probability|prediction| +---------+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------+--------------+-----------------------+---------------------+-----------------------------+----------------------------+--------------------+-----------------------------+---------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+ |768805383|Existing Customer|          45|     M|              3|    High School|       Married|    \u041f\u0420\u041e\u0412\u0415\u0420\u0418\u041c \u0420\u0415\u0417\u0423\u041b\u042c\u0422\u0410\u0422\u0414\u043b\u044f \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0438 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u0430 \u0432\u044b\u0447\u0438\u0441\u043b\u0438\u043c\u00a0\u041c\u0430\u0442\u0440\u0438\u0446\u0443 \u043e\u0448\u0438\u0431\u043e\u043a:123456val tp = prediction.filter((\" class=\"formula inline\">\"Attrition_Flag\" === \"Attrited Customer\") and (\"prediction\" === 1)).countval tn = prediction.filter((\" class=\"formula inline\">\"Attrition_Flag\" === \"Existing Customer\") and (\"prediction\" === 0)).countval fp = prediction.filter((\" class=\"formula inline\">\"Attrition_Flag\" === \"Existing Customer\") and (\"prediction\" === 1)).countval fn = prediction.filter((\" class=\"formula inline\">\"Attrition_Flag\" === \"Attrited Customer\") and (\"prediction\" === 0)).count\u00a0println(s\"Confusion Matrix:\\n\" class=\"formula inline\">tp\\tfn\\t\\t)Confusion Matrix: 11991893 4286607\u0412\u044b\u0447\u0438\u0441\u043b\u0438\u043c \u0442\u0430\u043a\u0436\u0435\u00a0Accuracy, Precision, Recall:1234567val accuracy = (tp + tn) \/ (tp + tn + fp + fn).toDoubleval precision = tp \/ (tp + fp).toDoubleval recall = tp \/ (tp + fn).toDouble\u00a0println(s\"Accuracy = $accuracy\")println(s\"Precision = $precision\")println(s\"Recall = $recall\\n\")Accuracy = 0.7708107040584576 Precision = 0.38777490297542044 Recall = 0.7369391518131531\u041f\u0420\u0415\u0414\u0412\u0410\u0420\u0418\u0422\u0415\u041b\u042c\u041d\u042b\u0419 \u0420\u0410\u0421\u0427\u0401\u0422 (PRECOMPUTE)\u0420\u0430\u0437\u0443\u043c\u0435\u0435\u0442\u0441\u044f \u043d\u0438\u043a\u0442\u043e \u043d\u0435 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442 \u0434\u043b\u044f\u00a0Production. \u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u043f\u0438\u0448\u0435\u0442\u0441\u044f \u043a\u043e\u0434, \u0441\u043e\u0431\u0438\u0440\u0430\u0435\u043c\u044b\u0439 \u0432 \u0438\u0441\u043f\u043e\u043b\u043d\u044f\u0435\u043c\u044b\u0439 \u0444\u0430\u0439\u043b, \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0435\u043c\u044b\u0439 \u043d\u0430 \u043a\u043b\u0430\u0441\u0442\u0435\u0440\u0435.\u041e\u0434\u043d\u0438\u043c \u0438\u0437 \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u0432 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f ML \u0432 Production \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f\u00a0\u041f\u0440\u0435\u0434\u0432\u0430\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0439 \u0440\u0430\u0441\u0447\u0451\u0442 (Precompute). \u0412 \u043f\u0430\u043a\u0435\u0442\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435, \u043f\u043e \u0440\u0430\u0441\u043f\u0438\u0441\u0430\u043d\u0438\u044e, \u043e\u0431\u0443\u0447\u0435\u043d\u043d\u0430\u044f \u043c\u043e\u0434\u0435\u043b\u044c \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0435\u0442\u0441\u044f \u043a \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445. \u0418\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440\u044b \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432, \u0434\u043b\u044f \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043c\u043e\u0434\u0435\u043b\u044c \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u0435\u0442 \u043e\u0442\u0442\u043e\u043a, \u0441\u043e\u0445\u0440\u0430\u043d\u044f\u044e\u0442\u0441\u044f \u0434\u043b\u044f \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u0432 \u0434\u0430\u043b\u044c\u043d\u0435\u0439\u0448\u0438\u0445 \u0431\u0438\u0437\u043d\u0435\u0441-\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0430\u0445.\u0418\u0441\u0445\u043e\u0434\u043d\u044b\u0439 \u043a\u043e\u0434 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u00a0\u041f\u0440\u0435\u0434\u0432\u0430\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0439 \u0440\u0430\u0441\u0447\u0451\u0442\u00a0\u0432\u044b\u0433\u043b\u044f\u0434\u0438\u0442 \u0442\u0430\u043a:12345678910111213141516171819202122232425262728293031323334353637383940414243444546package ru.otus.sparkml\u00a0import org.apache.spark.sql.{SaveMode, SparkSession}import org.apache.spark.ml.PipelineModel\u00a0object ProdML {\u00a0\u00a0def main(args: Array[String]): Unit = {\u00a0\u00a0\u00a0\u00a0if (args.length != 3) {\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0println(\"Usage: SparkML &lt;path-to-model> &lt;path-to-input> &lt;path-to-output>\")\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0sys.exit(-1)\u00a0\u00a0\u00a0\u00a0}\u00a0\u00a0\u00a0\u00a0\u00a0val spark = SparkSession.builder\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.appName(\"SparkML\")\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.config(\"spark.sql.debug.maxToStringFields\", 100)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.getOrCreate()\u00a0\u00a0\u00a0\u00a0\u00a0import spark.implicits._\u00a0\u00a0\u00a0\u00a0\u00a0try {\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0val model = PipelineModel.load(args(0))\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0val data = spark.read\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.option(\"header\", \"true\")\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.option(\"inferSchema\", \"true\")\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.csv(args(1))\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0val prediction = model.transform(data)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0prediction\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.filter(\" class=\"formula inline\">\"prediction\" === 1)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.select(\"CLIENTNUM\")\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.repartition(1)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.write\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.mode(SaveMode.Overwrite)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.csv(args(2))\u00a0\u00a0\u00a0\u00a0\u00a0} catch {\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0case e: Exception =>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0println(s\"ERROR: ${e.getMessage}\")\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0sys.exit(-1)\u00a0\u00a0\u00a0\u00a0} finally {\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0spark.stop()\u00a0\u00a0\u00a0\u00a0}\u00a0\u00a0}}\u0412\u0435\u0441\u044c \u043f\u0440\u043e\u0435\u043a\u0442 \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f \u0437\u0434\u0435\u0441\u044c:\u00a0https:\/\/github.com\/vzaigrin\/otus\/tree\/main\/SparkMLSHARE THIS:TwitterFacebookRELATEDUniversal Storage CollectorJanuary 28, 2017In \"EMC\"One-wire on Raspberry Pi with FreeBSD\u00a011January 12, 2016In \"Arduino\"VNXCollector \u2013 DIY EMC VNX Monitoring and\u00a0ReportingMarch 8, 2016In \"EMC\"<\/code><\/pre>\n<h2>\u041f\u0440\u043e\u0432\u0435\u0440\u0438\u043c \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442<\/h2>\n<p>\u0414\u043b\u044f \u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0438 \u0440\u0435\u0437\u0443\u043b\u044c\u0442\u0430\u0442\u0430 \u0432\u044b\u0447\u0438\u0441\u043b\u0438\u043c\u00a0<em>\u041c\u0430\u0442\u0440\u0438\u0446\u0443 \u043e\u0448\u0438\u0431\u043e\u043a<\/em>:<\/p>\n<pre><code>val tp = prediction.filter(($\"Attrition_Flag\" === \"Attrited Customer\") and ($\"prediction\" === 1)).count val tn = prediction.filter(($\"Attrition_Flag\" === \"Existing Customer\") and ($\"prediction\" === 0)).count val fp = prediction.filter(($\"Attrition_Flag\" === \"Existing Customer\") and ($\"prediction\" === 1)).count val fn = prediction.filter(($\"Attrition_Flag\" === \"Attrited Customer\") and ($\"prediction\" === 0)).count   println(s\"Confusion Matrix:\\n$tp\\t$fp\\n$fn\\t\\t$tn\\n\")<\/code><\/pre>\n<pre><code>Confusion Matrix: 11991893 4286607<\/code><\/pre>\n<p>\u0412\u044b\u0447\u0438\u0441\u043b\u0438\u043c \u0442\u0430\u043a\u0436\u0435\u00a0<code>Accuracy<\/code><em>, <\/em><code>Precision<\/code><em>, <\/em><code>Recall<\/code>:<\/p>\n<pre><code>val accuracy = (tp + tn) \/ (tp + tn + fp + fn).toDouble val precision = tp \/ (tp + fp).toDouble val recall = tp \/ (tp + fn).toDouble   println(s\"Accuracy = $accuracy\") println(s\"Precision = $precision\") println(s\"Recall = $recall\\n\")<\/code><\/pre>\n<pre><code>Accuracy = 0.7708107040584576 Precision = 0.38777490297542044 Recall = 0.7369391518131531<\/code><\/pre>\n<h3>\u041f\u0440\u0435\u0434\u0432\u0430\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0439 \u0440\u0430\u0441\u0447\u0435\u0442 (PRECOMPUTE)<\/h3>\n<p>\u0420\u0430\u0437\u0443\u043c\u0435\u0435\u0442\u0441\u044f \u043d\u0438\u043a\u0442\u043e \u043d\u0435 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442 \u0434\u043b\u044f\u00a0<code>Production<\/code>. \u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u043f\u0438\u0448\u0435\u0442\u0441\u044f \u043a\u043e\u0434, \u0441\u043e\u0431\u0438\u0440\u0430\u0435\u043c\u044b\u0439 \u0432 \u0438\u0441\u043f\u043e\u043b\u043d\u044f\u0435\u043c\u044b\u0439 \u0444\u0430\u0439\u043b, \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0435\u043c\u044b\u0439 \u043d\u0430 \u043a\u043b\u0430\u0441\u0442\u0435\u0440\u0435.<\/p>\n<p>\u041e\u0434\u043d\u0438\u043c \u0438\u0437 \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u0432 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f ML \u0432 Production \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f\u00a0<em>\u041f\u0440\u0435\u0434\u0432\u0430\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0439 \u0440\u0430\u0441\u0447\u0451\u0442 (<\/em><code>Precompute<\/code><em>)<\/em>. \u0412 \u043f\u0430\u043a\u0435\u0442\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435, \u043f\u043e \u0440\u0430\u0441\u043f\u0438\u0441\u0430\u043d\u0438\u044e, \u043e\u0431\u0443\u0447\u0435\u043d\u043d\u0430\u044f \u043c\u043e\u0434\u0435\u043b\u044c \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0435\u0442\u0441\u044f \u043a \u043d\u0430\u0431\u043e\u0440\u0443 \u0434\u0430\u043d\u043d\u044b\u0445. \u0418\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440\u044b \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432, \u0434\u043b\u044f \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u043c\u043e\u0434\u0435\u043b\u044c \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u0435\u0442 \u043e\u0442\u0442\u043e\u043a, \u0441\u043e\u0445\u0440\u0430\u043d\u044f\u044e\u0442\u0441\u044f \u0434\u043b\u044f \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u0432 \u0434\u0430\u043b\u044c\u043d\u0435\u0439\u0448\u0438\u0445 \u0431\u0438\u0437\u043d\u0435\u0441-\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0430\u0445.<\/p>\n<p>\u0418\u0441\u0445\u043e\u0434\u043d\u044b\u0439 \u043a\u043e\u0434 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u044f \u0441\u043f\u043e\u0441\u043e\u0431\u0430\u00a0<em>\u041f\u0440\u0435\u0434\u0432\u0430\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0439 \u0440\u0430\u0441\u0447\u0451\u0442<\/em>\u00a0\u0432\u044b\u0433\u043b\u044f\u0434\u0438\u0442 \u0442\u0430\u043a:<\/p>\n<pre><code>package ru.otus.sparkml   import org.apache.spark.sql.{SaveMode, SparkSession} import org.apache.spark.ml.PipelineModel   object ProdML {   def main(args: Array[String]): Unit = {     if (args.length != 3) {       println(\"Usage: SparkML &lt;path-to-model> &lt;path-to-input> &lt;path-to-output>\")       sys.exit(-1)     }       val spark = SparkSession.builder       .appName(\"SparkML\")       .config(\"spark.sql.debug.maxToStringFields\", 100)       .getOrCreate()       import spark.implicits._       try {       val model = PipelineModel.load(args(0))         val data = spark.read         .option(\"header\", \"true\")         .option(\"inferSchema\", \"true\")         .csv(args(1))         val prediction = model.transform(data)         prediction         .filter($\"prediction\" === 1)         .select(\"CLIENTNUM\")         .repartition(1)         .write         .mode(SaveMode.Overwrite)         .csv(args(2))       } catch {       case e: Exception =>         println(s\"ERROR: ${e.getMessage}\")         sys.exit(-1)     } finally {       spark.stop()     }   } }<\/code><\/pre>\n<p>\u0412\u0435\u0441\u044c \u043f\u0440\u043e\u0435\u043a\u0442 \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f \u0437\u0434\u0435\u0441\u044c:\u00a0<a href=\"https:\/\/github.com\/vzaigrin\/otus\/tree\/main\/SparkML\"><u>https:\/\/github.com\/vzaigrin\/otus\/tree\/main\/SparkML<\/u><\/a><\/p>\n<hr\/>\n<p>\u0412\u0441\u0435\u0445 \u0437\u0430\u0438\u043d\u0442\u0435\u0440\u0435\u0441\u043e\u0432\u0430\u043d\u043d\u044b\u0445 \u043f\u0440\u0438\u0433\u043b\u0430\u0448\u0430\u0435\u043c \u043d\u0430 <a href=\"https:\/\/otus.pw\/Xe24\/\"><strong>\u043e\u0442\u043a\u0440\u044b\u0442\u044b\u0439 \u0443\u0440\u043e\u043a<\/strong><\/a> \u00ab\u0410\u0440\u0445\u0438\u0442\u0435\u043a\u0442\u0443\u0440\u0430 \u0444\u0440\u0435\u0439\u043c\u0432\u043e\u0440\u043a\u0430 Apache Spark\u00bb. \u041d\u0430 \u044d\u0442\u043e\u043c \u0437\u0430\u043d\u044f\u0442\u0438\u0438 \u0440\u0430\u0441\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u0432\u043d\u0443\u0442\u0440\u0435\u043d\u043d\u0435\u0435 \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u043e Apache Spark:<\/p>\n<ul>\n<li>\n<p>\u0447\u0442\u043e \u044d\u0442\u043e \u0442\u0430\u043a\u043e\u0435 \u0438 \u0437\u0430\u0447\u0435\u043c \u043e\u043d \u043d\u0443\u0436\u0435\u043d;<\/p>\n<\/li>\n<li>\n<p>\u043a\u0430\u043a \u0440\u0430\u0431\u043e\u0442\u0430\u044e\u0442 \u0440\u0430\u0441\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u0435 \u043f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u044f \u043d\u0430 \u044d\u0442\u043e\u043c \u0444\u0440\u0435\u0439\u043c\u0432\u043e\u0440\u043a\u0435;<\/p>\n<\/li>\n<li>\n<p>\u0438\u0437 \u0447\u0435\u0433\u043e \u0441\u043e\u0441\u0442\u043e\u044f\u0442 \u044d\u0442\u0438 \u043f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u044f;<\/p>\n<\/li>\n<li>\n<p>\u043a\u0430\u043a \u043e\u043d\u0438 \u043c\u0430\u0441\u0448\u0442\u0430\u0431\u0438\u0440\u0443\u044e\u0442\u0441\u044f.<\/p>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"v-portal\" style=\"display:none;\"><\/div>\n<\/div>\n<p> <!----> <!----><br \/> \u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/company\/otus\/blog\/653033\/\"> https:\/\/habr.com\/ru\/company\/otus\/blog\/653033\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<figure class=\"full-width\"><figcaption><\/figcaption><\/figure>\n<p>\u041f\u0440\u0438\u0432\u0435\u0442, \u0425\u0430\u0431\u0440. \u0414\u0435\u043b\u0438\u043c\u0441\u044f \u0430\u0432\u0442\u043e\u0440\u0441\u043a\u043e\u0439 \u0441\u0442\u0430\u0442\u044c\u0435\u0439 \u043f\u0440\u0435\u043f\u043e\u0434\u0430\u0432\u0430\u0442\u0435\u043b\u044f OTUS \u0412\u0430\u0434\u0438\u043c\u0430 \u0417\u0430\u0438\u0433\u0440\u0438\u043d\u0430.<\/p>\n<h3>Apache Spark<\/h3>\n<p><a href=\"http:\/\/spark.apache.org\/\"><u>Apache Spark<\/u><\/a>\u00a0\u2013 \u044d\u0442\u043e \u0440\u0430\u0441\u043f\u0440\u0435\u0434\u0435\u043b\u0435\u043d\u043d\u044b\u0439 \u0444\u0440\u0435\u0439\u043c\u0432\u043e\u0440\u043a \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u0434\u0430\u043d\u043d\u044b\u0445, \u0441\u0442\u0430\u0432\u0448\u0438\u0439 \u0434\u0435-\u0444\u0430\u043a\u0442\u043e \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043e\u043c \u0432 \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0435 \u0431\u043e\u043b\u044c\u0448\u0438\u0445 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>Spark \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u0438\u0445 \u043a\u043e\u043c\u043f\u043e\u043d\u0435\u043d\u0442\u043e\u0432, \u0432 \u0447\u0438\u0441\u043b\u043e, \u043a\u043e\u0442\u043e\u0440\u044b\u0445 \u0432\u0445\u043e\u0434\u0438\u0442 \u0438 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f.<\/p>\n<figure class=\"full-width\"><figcaption>Spark stack<\/figcaption><\/figure>\n<p><strong>Spark ML<\/strong>\u00a0\u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u0431\u0430\u0437\u043e\u0432\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0438\u043d\u0441\u0442\u0440\u0443\u043c\u0435\u043d\u0442\u043e\u0432 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f:<\/p>\n<ul>\n<li>\n<p>\u0410\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u044b, \u0442\u0430\u043a\u0438\u0435 \u043a\u0430\u043a \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u044f, \u0440\u0435\u0433\u0440\u0435\u0441\u0441\u0438\u044f, \u043a\u043b\u0430\u0441\u0442\u0435\u0440\u0438\u0437\u0430\u0446\u0438\u044f \u0438 \u0441\u043e\u0432\u043c\u0435\u0441\u0442\u043d\u0430\u044f \u0444\u0438\u043b\u044c\u0442\u0440\u0430\u0446\u0438\u044f.<\/p>\n<\/li>\n<li>\n<p>\u041c\u0435\u0442\u043e\u0434\u044b \u0440\u0430\u0431\u043e\u0442\u044b \u0441 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0430\u043c\u0438.<\/p>\n<\/li>\n<li>\n<p>\u041a\u043e\u043d\u0432\u0435\u0439\u0435\u0440\u044b (pipelines).<\/p>\n<\/li>\n<li>\n<p>\u0421\u043e\u0445\u0440\u0430\u043d\u0435\u043d\u0438\u0435 \u0438 \u0437\u0430\u0433\u0440\u0443\u0437\u043a\u0430 \u043c\u043e\u0434\u0435\u043b\u0435\u0439 \u0438 \u043a\u043e\u043d\u0432\u0435\u0439\u0435\u0440\u043e\u0432.<\/p>\n<\/li>\n<li>\n<p>\u0423\u0442\u0438\u043b\u0438\u0442\u044b: \u043b\u0438\u043d\u0435\u0439\u043d\u0430\u044f \u0430\u043b\u0433\u0435\u0431\u0440\u0430, \u0441\u0442\u0430\u0442\u0438\u0441\u0442\u0438\u043a\u0430, \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430 \u0434\u0430\u043d\u043d\u044b\u0445 \u0438 \u0442.\u0434.<\/p>\n<\/li>\n<\/ul>\n<p>\u041f\u043e \u0441\u0440\u0430\u0432\u043d\u0435\u043d\u0438\u044e \u0441 \u0434\u0440\u0443\u0433\u0438\u043c\u0438 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430\u043c\u0438 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f, \u0442\u0430\u043a\u0438\u043c\u0438 \u043a\u0430\u043a\u00a0<a href=\"http:\/\/scikit-learn.org\/\"><u>scikit-learn<\/u><\/a>\u00a0\u043d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u043d\u0430\u0431\u043e\u0440 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u043e\u0432 \u0432 Spark ML \u0432\u044b\u0433\u043b\u044f\u0434\u0438\u0442 \u0441\u043a\u0440\u043e\u043c\u043d\u0435\u0435, \u043d\u043e \u043e\u043d \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0432\u0441\u0435 \u043e\u0441\u043d\u043e\u0432\u043d\u044b\u0435 \u043c\u0435\u0442\u043e\u0434\u044b. \u041a\u0440\u043e\u043c\u0435 \u0442\u043e\u0433\u043e, Spark ML \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u0434\u043e\u0431\u0430\u0432\u043b\u044f\u0442\u044c \u0441\u0432\u043e\u0438 \u043c\u0435\u0442\u043e\u0434\u044b \u0438 \u0440\u0435\u0430\u043b\u0438\u0437\u043e\u0432\u044b\u0432\u0430\u0442\u044c \u043d\u0435\u0434\u043e\u0441\u0442\u0430\u044e\u0449\u0438\u0435 \u0430\u043b\u0433\u043e\u0440\u0438\u0442\u043c\u044b.<\/p>\n<p><strong>Spark ML<\/strong>\u00a0\u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u0434\u0432\u0443\u0445 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a:<\/p>\n<ul>\n<li>\n<p>spark.ml \u2013 \u044d\u0442\u043e \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f, \u043e\u0441\u043d\u043e\u0432\u0430\u043d\u043d\u0430\u044f \u043d\u0430 DataFrame API;<\/p>\n<\/li>\n<li>\n<p>spark.mllib \u2013 \u043d\u0430 RDD API.<\/p>\n<\/li>\n<\/ul>\n<p>\u041d\u0430\u0447\u0438\u043d\u0430\u044f \u0441 \u0432\u0435\u0440\u0441\u0438\u0438 2.0 \u043e\u0441\u043d\u043e\u0432\u043d\u043e\u0439 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u043e\u0439 \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f spark.ml, \u043d\u043e \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430 spark.mllib \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0442\u0438\u043f\u044b \u0434\u0430\u043d\u043d\u044b\u0445, \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u043c\u044b\u0435 \u0432 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0435 spark.ml<\/p>\n<p>\u041e\u0431\u0430 \u0432\u0430\u0440\u0438\u0430\u043d\u0442\u0430 Spark ML \u0445\u043e\u0440\u043e\u0448\u043e \u043e\u043f\u0438\u0441\u0430\u043d\u044b \u0432\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-guide.html\"><u>\u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0438<\/u><\/a>. \u041d\u043e \u044f \u043d\u0435 \u0431\u0443\u0434\u0443 \u043f\u0435\u0440\u0435\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u0442\u044c \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u044e. \u0420\u0430\u0441\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043a\u0430\u043a \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441\u043e Spark ML \u043d\u0430 \u043a\u043e\u043d\u043a\u0440\u0435\u0442\u043d\u043e\u043c \u043f\u0440\u0438\u043c\u0435\u0440\u0435.<\/p>\n<h3>\u0417\u0430\u0433\u0440\u0443\u0436\u0430\u0435\u043c Spark<\/h3>\n<p>Spark \u043c\u043e\u0436\u043d\u043e \u0437\u0430\u043f\u0443\u0441\u0442\u0438\u0442\u044c \u0432 \u043b\u043e\u043a\u0430\u043b\u044c\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435, \u0431\u0435\u0437 \u043a\u043b\u0430\u0441\u0442\u0435\u0440\u0430. \u042d\u0442\u043e \u043f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u043f\u043e\u0437\u043d\u0430\u043a\u043e\u043c\u0438\u0442\u0441\u044f \u0441 API, \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u043e\u0441\u043e\u0431\u0435\u043d\u043d\u043e\u0441\u0442\u0438 \u0440\u0430\u0431\u043e\u0442\u044b \u0441 \u043d\u0438\u043c.<\/p>\n<p>Spark \u0440\u0430\u0431\u043e\u0442\u0430\u0435\u0442 \u043d\u0430 JVM. \u041f\u043e\u044d\u0442\u043e\u043c\u0443 \u0434\u043b\u044f \u0437\u0430\u043f\u0443\u0441\u043a\u0430 \u0437\u0430\u0434\u0430\u043d\u0438\u0439 \u0438 \u0440\u0430\u0437\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u043f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0439 \u043d\u0430 \u043a\u043e\u043c\u043f\u044c\u044e\u0442\u0435\u0440\u0435 \u0434\u043e\u043b\u0436\u0435\u043d \u0431\u044b\u0442\u044c \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d JDK, \u043f\u0443\u0442\u044c \u043a\u00a0<em>java<\/em>\u00a0\u0434\u043e\u043b\u0436\u0435\u043d \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u044c\u0441\u044f \u0432 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439 PATH, \u0438 \u0434\u043e\u043b\u0436\u043d\u0430 \u0431\u044b\u0442\u044c \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0430 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f JAVA_HOME.<\/p>\n<p>\u0427\u0442\u043e \u0437\u0430\u043f\u0443\u0441\u0442\u0438\u0442\u044c Spark \u0432 \u043b\u043e\u043a\u0430\u043b\u044c\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435 \u043d\u0430\u0434\u043e \u043f\u0440\u043e\u0434\u0435\u043b\u0430\u0442\u044c \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0435\u0435:<\/p>\n<ol>\n<li>\n<p>C\u043a\u0430\u0447\u0430\u0442\u044c \u0434\u0438\u0441\u0442\u0440\u0438\u0431\u0443\u0442\u0438\u0432 Spark \u043d\u0430 \u0441\u0432\u043e\u0439 \u043a\u043e\u043c\u043f\u044c\u044e\u0442\u0435\u0440:\u00a0<a href=\"http:\/\/spark.apache.org\/downloads.html\"><u>http:\/\/spark.apache.org\/downloads.html<\/u><\/a><\/p>\n<ul>\n<li>\n<p>\u0418\u0437 \u0441\u043f\u0438\u0441\u043a\u0430 \u0432\u0435\u0440\u0441\u0438\u0439 \u043d\u0430\u0434\u043e \u0432\u044b\u0431\u0440\u0430\u0442\u044c \u0442\u0443, \u043a\u043e\u0442\u043e\u0440\u0430\u044f \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u0443 \u0432\u0430\u0441 \u043d\u0430 \u0440\u0430\u0431\u043e\u0442\u0435. \u0415\u0441\u043b\u0438 \u043d\u0430 \u0440\u0430\u0431\u043e\u0442\u0435 Spark \u043d\u0435 \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f, \u0430 \u0435\u0441\u0442\u044c \u043f\u043e\u0442\u0440\u0435\u0431\u043d\u043e\u0441\u0442\u044c \u0432 \u0435\u0433\u043e \u0438\u0437\u0443\u0447\u0435\u043d\u0438\u0438, \u0442\u043e \u043b\u0443\u0447\u0448\u0435 \u0441\u043a\u0430\u0447\u0438\u0432\u0430\u0442\u044c \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u044e\u044e \u0432\u0435\u0440\u0441\u0438\u044e.<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u043c\u0438\u043c\u043e \u0432\u0435\u0440\u0441\u0438\u0438 \u0441\u0430\u043c\u043e\u0433\u043e Spark \u0435\u0441\u0442\u044c \u0432\u044b\u0431\u043e\u0440 \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u043c\u044b\u0445 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a Hadoop. \u0422\u0430\u043a \u043a\u0430\u043a \u043c\u044b \u0441\u043e\u0431\u0438\u0440\u0430\u0435\u043c\u0441\u044f \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0442\u044c Spark \u043b\u043e\u043a\u0430\u043b\u044c\u043d\u043e, \u0442\u043e \u0432\u0430\u0440\u0438\u0430\u043d\u0442 \u201cPre-built with user-provided Apache Hadoop\u201d \u043d\u0430\u043c \u043d\u0435 \u043f\u043e\u0434\u0445\u043e\u0434\u0438\u0442, \u0442\u0430\u043a \u043a\u0430\u043a \u0432 \u044d\u0442\u043e\u043c \u0441\u043b\u0443\u0447\u0430\u0435 \u043f\u0440\u0438\u0434\u0451\u0442\u0441\u044f \u0441\u043a\u0430\u0447\u0438\u0432\u0430\u0442\u044c \u0438 \u0443\u0441\u0442\u0430\u043d\u0430\u0432\u043b\u0438\u0432\u0430\u0442\u044c \u0435\u0449\u0451 \u0438 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0438 Hadoop. \u041d\u0430\u0434\u043e \u0432\u044b\u0431\u0440\u0430\u0442\u044c \u043e\u0434\u0438\u043d \u0438\u0437 \u201cPre-built for Apache Hadoop \u2026\u201d.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>\u0420\u0430\u0441\u043f\u0430\u043a\u043e\u0432\u0430\u0442\u044c \u0430\u0440\u0445\u0438\u0432, \u043d\u0430\u043f\u0440\u0438\u043c\u0435\u0440 \u0432 \u043f\u0430\u043f\u043a\u0443\u00a0<code>\/opt\/spark<\/code>.<\/p>\n<\/li>\n<li>\n<p>\u041f\u0440\u0438 \u0436\u0435\u043b\u0430\u043d\u0438\u0438 \u043c\u043e\u0436\u043d\u043e \u0438\u0437\u043c\u0435\u043d\u0438\u0442\u044c \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b, \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u043d\u044b\u0435 \u043f\u043e-\u0443\u043c\u043e\u043b\u0447\u0430\u043d\u0438\u044e. \u041e\u043d\u0438 \u043d\u0430\u0445\u043e\u0434\u044f\u0442\u0441\u044f \u0432 \u043f\u0430\u043f\u043a\u0435\u00a0<code>conf<\/code>:<\/p>\n<ul>\n<li>\n<p><em>log4j.properties<\/em>\u00a0\u2013 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b \u043b\u043e\u0433\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u044f (\u041d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u0437\u0430\u043c\u0435\u043d\u0438\u0442\u044c INFO \u043d\u0430 WARN);<\/p>\n<\/li>\n<li>\n<p><em>spark-defaults.conf<\/em>\u00a0\u2013 \u043f\u0430\u0440\u0430\u043c\u0435\u0442\u0440\u044b spark-submit (\u041d\u0430\u043f\u0440\u0438\u043c\u0435\u0440, \u0443\u0432\u0435\u043b\u0438\u0447\u0438\u0442\u044c \u043f\u0430\u043c\u044f\u0442\u044c \u0434\u0440\u0430\u0439\u0432\u0435\u0440\u0430).<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>\u041f\u0440\u043e\u043f\u0438\u0441\u0430\u0442\u044c \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e SPARK_HOME<\/p>\n<\/li>\n<\/ol>\n<h3>\u0417\u0430\u043f\u0443\u0441\u043a\u0430\u0435\u043c Spark<\/h3>\n<p>\u041f\u0440\u0435\u0436\u0434\u0435, \u0447\u0435\u043c \u043f\u0438\u0441\u0430\u0442\u044c \u0438 \u043a\u043e\u043c\u043f\u0438\u043b\u0438\u0440\u043e\u0432\u0430\u0442\u044c \u043f\u0440\u043e\u0433\u0440\u0430\u043c\u043c\u0443 \u0434\u043b\u044f Spark, \u0436\u0435\u043b\u0430\u0442\u0435\u043b\u044c\u043d\u043e \u043f\u043e\u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441 \u043d\u0438\u043c \u0432 \u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u043e\u043c \u0440\u0435\u0436\u0438\u043c\u0435 (REPL).<\/p>\n<p>\u0414\u043b\u044f \u044d\u0442\u043e\u0433\u043e \u0435\u0441\u0442\u044c \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u0432\u0430\u0440\u0438\u0430\u043d\u0442\u043e\u0432:<\/p>\n<ul>\n<li>\n<p>spark-shell (pyspark)\u041a\u043e\u043d\u0441\u043e\u043b\u044c\u043d\u044b\u0439 Scala\/Python REPL \u0441 \u043d\u0430\u0441\u0442\u0440\u043e\u0435\u043d\u043d\u044b\u043c Spark. \u0412\u0445\u043e\u0434\u0438\u0442 \u0432 \u0434\u0438\u0441\u0442\u0440\u0438\u0431\u0443\u0442\u0438\u0432 Spark. \u041d\u0435\u0443\u0434\u043e\u0431\u0435\u043d \u043f\u0440\u0438 \u0434\u043b\u0438\u0442\u0435\u043b\u044c\u043d\u043e\u0439 \u0440\u0430\u0431\u043e\u0442\u0435.<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/zeppelin.apache.org\/\"><u>Apache Zeppelin<\/u><\/a>\u0421\u0435\u0440\u0432\u0438\u0441 \u043d\u043e\u0443\u0442\u0431\u0443\u043a\u043e\u0432 \u0432 \u0431\u0440\u0430\u0443\u0437\u0435\u0440\u0435. \u041f\u043e\u0434\u0434\u0435\u0440\u0436\u0438\u0432\u0430\u0435\u0442 \u0431\u043e\u043b\u044c\u0448\u043e\u0435 \u043a\u043e\u043b\u0438\u0447\u0435\u0441\u0442\u0432\u043e \u0438\u043d\u0442\u0435\u0440\u043f\u0440\u0435\u0442\u0430\u0442\u043e\u0440\u043e\u0432, \u0432\u043a\u043b\u044e\u0447\u0430\u044f Spark, Scala \u0438 Python. \u0423\u0434\u043e\u0431\u0435\u043d \u0442\u0435\u043c, \u0447\u0442\u043e \u043a\u0430\u043a \u0438 \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u0439 \u043a\u043e\u043d\u0441\u043e\u043b\u044c\u043d\u044b\u0439 REPL \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u043d\u0430\u0441\u0442\u0440\u043e\u0435\u043d\u043d\u044b\u0439 Spark.<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/livy.apache.org\/\"><u>Apache Livy<\/u><\/a>REST \u0441\u0435\u0440\u0432\u0438\u0441 \u0434\u043b\u044f Spark. \u041f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u0437\u0430\u043f\u0443\u0441\u043a\u0430\u0442\u044c \u0437\u0430\u0434\u0430\u043d\u0438\u044f \u0438 \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u043e.<\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/toree.apache.org\/\"><u>Apache Toree<\/u><\/a>\u042f\u0434\u0440\u043e \u0434\u043b\u044f Jupyter Notebook \u0434\u043b\u044f \u0440\u0430\u0431\u043e\u0442\u044b \u0441\u043e Spark.<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/almond.sh\/\"><u>Almond<\/u><\/a>Scala \u044f\u0434\u0440\u043e \u0434\u043b\u044f Jupyter. \u041f\u043e\u0434\u0434\u0435\u0440\u0436\u0438\u0432\u0430\u0435\u0442 Spark.<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/plugins.jetbrains.com\/plugin\/12494-big-data-tools\"><u>JetBrains Big Data Tools<\/u><\/a>\u041f\u043b\u0430\u0433\u0438\u043d \u0434\u043b\u044f IntelliJ\u00a0IDEA, DataGrip \u0438 PyCharm IDE \u043e\u0442 JetBrains. \u041f\u043e\u0437\u0432\u043e\u043b\u044f\u0435\u0442 \u043f\u0440\u044f\u043c\u043e \u0438\u0437 IDE \u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u0441 \u043d\u043e\u0443\u0442\u0431\u0443\u043a\u0430\u043c\u0438 Zeppelin, \u043f\u0440\u0435\u0434\u043e\u0441\u0442\u0430\u0432\u043b\u044f\u0435\u0442 \u0434\u043e\u0441\u0442\u0443\u043f \u043a \u043c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433\u0443 Spark \u0438 Kafka, \u0434\u043e\u0441\u0442\u0443\u043f \u043a HDFS \u0438 \u0442.\u043f.<\/p>\n<\/li>\n<\/ul>\n<p>\u041b\u0438\u0447\u043d\u043e \u044f \u043f\u0440\u0435\u0434\u043f\u043e\u0447\u0438\u0442\u0430\u044e \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c Apache Zeppelin \u0432\u043c\u0435\u0441\u0442\u0435 \u0441 JetBrains Big Data Tools.<\/p>\n<h3>\u0417\u0430\u0434\u0430\u0447\u0430 \u043c\u0430\u0448\u0438\u043d\u043d\u043e\u0433\u043e \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u044f<\/h3>\n<p>\u0412 \u043a\u0430\u0447\u0435\u0441\u0442\u0432\u0435 \u043f\u0440\u0438\u043c\u0435\u0440\u0430 \u0432\u043e\u0437\u044c\u043c\u0451\u043c \u0437\u0430\u0434\u0430\u0447\u0443 \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u0430\u043d\u0438\u044f \u043e\u0442\u0442\u043e\u043a\u0430 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432 \u0431\u0430\u043d\u043a\u0430.<br \/>\u041e\u043f\u0438\u0441\u0430\u043d\u0438\u0435 \u0437\u0430\u0434\u0430\u0447\u0438 \u0438 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u043d\u0430\u0445\u043e\u0434\u0438\u0442\u0441\u044f <a href=\"https:\/\/www.kaggle.com\/sakshigoyal7\/credit-card-customers\">\u043d\u0430 \u0441\u0430\u0439\u0442\u0435 Kaggle<\/a>.<\/p>\n<p>\u042d\u0442\u043e\u0442 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 10 000 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432 \u0438 \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0442\u0430\u043a\u0438\u0435 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u0438, \u043a\u0430\u043a \u0432\u043e\u0437\u0440\u0430\u0441\u0442, \u0437\u0430\u0440\u043f\u043b\u0430\u0442\u0430, \u0441\u0442\u0430\u0442\u0443\u0441 \u043f\u043e \u0441\u043e\u0441\u0442\u043e\u044f\u043d\u0438\u044e \u0437\u0434\u043e\u0440\u043e\u0432\u044c\u044f, \u043b\u0438\u043c\u0438\u0442 \u043a\u0440\u0435\u0434\u0438\u0442\u043d\u043e\u0439 \u043a\u0430\u0440\u0442\u044b, \u043a\u0430\u0442\u0435\u0433\u043e\u0440\u0438\u044e \u043a\u0440\u0435\u0434\u0438\u0442\u043d\u043e\u0439 \u043a\u0430\u0440\u0442\u044b \u0438 \u0442.\u0434., \u0430 \u0442\u0430\u043a\u0436\u0435 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e\u00a0<code>Attrition_Flag<\/code>\u00a0\u0441 \u043f\u0440\u0438\u0437\u043d\u0430\u043a\u043e\u043c \u043e\u0442\u0442\u043e\u043a\u0430 (\u043f\u0435\u0440\u0435\u0441\u0442\u0430\u043b \u043b\u0438 \u043a\u043b\u0438\u0435\u043d\u0442 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u0441\u044f \u0443\u0441\u043b\u0443\u0433\u0430\u043c\u0438 \u0431\u0430\u043d\u043a\u0430).<\/p>\n<p>\u041c\u044b \u0440\u0435\u0448\u0430\u0435\u043c \u0437\u0430\u0434\u0430\u0447\u0443\u00a0<strong>\u0431\u0438\u043d\u0430\u0440\u043d\u043e\u0439 \u043a\u043b\u0430\u0441\u0441\u0438\u0444\u0438\u043a\u0430\u0446\u0438\u0438<\/strong>. \u041d\u0430\u043c \u043d\u0430\u0434\u043e \u043f\u043e\u0441\u0442\u0440\u043e\u0438\u0442\u044c \u043c\u043e\u0434\u0435\u043b\u044c, \u043f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u044b\u0432\u0430\u044e\u0449\u0443\u044e \u043a \u043a\u0430\u043a\u043e\u0439 \u0433\u0440\u0443\u043f\u043f\u0435 \u043e\u0442\u043d\u043e\u0441\u0438\u0442\u0441\u044f \u043a\u043b\u0438\u0435\u043d\u0442.<\/p>\n<h3>\u042d\u0442\u0430\u043f\u044b ML<\/h3>\n<p>\u0418\u0437 \u043a\u0430\u043a\u0438\u0445 \u0436\u0435 \u044d\u0442\u0430\u043f\u043e\u0432 \u0434\u043e\u043b\u0436\u0435\u043d \u0441\u043e\u0441\u0442\u043e\u044f\u0442\u044c \u043f\u0440\u043e\u0435\u043a\u0442 ML?<\/p>\n<p>\u0415\u0441\u0442\u044c \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u043b\u043e\u0433\u0438\u0439. \u0411\u0443\u0434\u0435\u043c \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u044c\u00a0<a href=\"https:\/\/ru.wikipedia.org\/wiki\/CRISP-DM\"><u>CRISP-DM<\/u><\/a>.<\/p>\n<h4>CRISP-DM<\/h4>\n<p><strong>CRISP-DM<\/strong>\u00a0(<em>Cross-Industry Standard Process for Data Mining<\/em>)\u00a0\u2014 \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u0440\u0430\u0441\u043f\u0440\u043e\u0441\u0442\u0440\u0430\u043d\u0451\u043d\u043d\u0430\u044f \u043c\u0435\u0442\u043e\u0434\u043e\u043b\u043e\u0433\u0438\u044f \u043f\u043e \u0438\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u043d\u0438\u044e \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<figure class=\"full-width\"><figcaption><\/figcaption><\/figure>\n<p>\u0418\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 \u043f\u043e \u043c\u0435\u0442\u043e\u0434\u043e\u043b\u043e\u0433\u0438\u0438 CRISP-DM \u0441\u043e\u0441\u0442\u043e\u0438\u0442 \u0438\u0437 \u0441\u043b\u0435\u0434\u0443\u044e\u0449\u0438\u0445 \u0444\u0430\u0437:<\/p>\n<ol>\n<li>\n<p>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0431\u0438\u0437\u043d\u0435\u0441-\u0446\u0435\u043b\u0435\u0439 (<em>Business Understanding<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445 (<em>Data Understanding<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u043a\u0430 \u0434\u0430\u043d\u043d\u044b\u0445 (<em>Data Preparation<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041c\u043e\u0434\u0435\u043b\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 (<em>Modeling<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u041e\u0446\u0435\u043d\u043a\u0430 (<em>Evaluation<\/em>);<\/p>\n<\/li>\n<li>\n<p>\u0412\u043d\u0435\u0434\u0440\u0435\u043d\u0438\u0435 (<em>Deployment<\/em>).<\/p>\n<\/li>\n<\/ol>\n<p>\u0411\u0443\u0434\u0435\u043c \u0440\u0435\u0448\u0430\u0442\u044c \u043d\u0430\u0448\u0443 \u0437\u0430\u0434\u0430\u0447\u0443 \u043f\u043e \u044d\u0442\u0438\u043c \u0448\u0430\u0433\u0430\u043c.<\/p>\n<h3>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0431\u0438\u0437\u043d\u0435\u0441-\u0446\u0435\u043b\u0435\u0439<\/h3>\n<p>\u0421 \u0431\u0438\u0437\u043d\u0435\u0441-\u0446\u0435\u043b\u044f\u043c\u0438 \u0432 \u043d\u0430\u0448\u0435\u043c \u0441\u043b\u0443\u0447\u0430\u0435 \u0432\u0441\u0451 \u043f\u0440\u043e\u0441\u0442\u043e. \u0411\u0430\u043d\u043a \u0437\u0430\u0438\u043d\u0442\u0435\u0440\u0435\u0441\u043e\u0432\u0430\u043d \u0432 \u0441\u043e\u0445\u0440\u0430\u043d\u0435\u043d\u0438\u0438 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432. \u041f\u0440\u0435\u0434\u0441\u043a\u0430\u0437\u0430\u0432 \u043a\u043b\u0438\u0435\u043d\u0442\u043e\u0432, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043e\u0442\u043d\u043e\u0441\u044f\u0442\u0441\u044f \u043a \u0433\u0440\u0443\u043f\u043f\u0435, \u0441\u043a\u043b\u043e\u043d\u043d\u043e\u0439 \u043a \u0443\u0445\u043e\u0434\u0443 \u0438\u0437 \u0431\u0430\u043d\u043a\u0430, \u043c\u043e\u0436\u043d\u043e \u0441\u0440\u0430\u0431\u043e\u0442\u0430\u0442\u044c \u043d\u0430 \u043e\u043f\u0435\u0440\u0435\u0436\u0435\u043d\u0438\u0435 \u0438 \u043f\u0440\u0435\u0434\u043b\u043e\u0436\u0438\u0442\u044c \u0438\u043c \u0432\u044b\u0433\u043e\u0434\u043d\u044b\u0435 \u0443\u0441\u043b\u043e\u0432\u0438\u044f, \u0447\u0442\u043e\u0431\u044b \u043e\u043d\u0438 \u043e\u0441\u0442\u0430\u043b\u0438\u0441\u044c \u043a\u043b\u0438\u0435\u043d\u0442\u0430\u043c\u0438 \u0431\u0430\u043d\u043a\u0430.<\/p>\n<h3>\u041f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u0435 \u0434\u0430\u043d\u043d\u044b\u0445<\/h3>\n<p>\u0414\u0430\u0432\u0430\u0439\u0442\u0435 \u0437\u0430\u0433\u0440\u0443\u0437\u0438\u043c \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0438 \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0438\u043c \u043d\u0430 \u043d\u0435\u0433\u043e.<\/p>\n<p>\u0414\u0430\u043d\u043d\u044b\u0435 \u043d\u0430\u0445\u043e\u0434\u044f\u0442\u0441\u044f \u0432 \u0444\u0430\u0439\u043b\u0435 \u0432 \u0444\u043e\u0440\u043c\u0430\u0442\u0435 CSV. \u0417\u0430\u0433\u0440\u0443\u0437\u0438\u043c \u0435\u0433\u043e \u0441\u0442\u0430\u043d\u0434\u0430\u0440\u0442\u043d\u044b\u043c \u0434\u043b\u044f Spark \u0441\u043f\u043e\u0441\u043e\u0431\u043e\u043c \u0432 \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0443\u044e\u00a0<code>raw<\/code>\u00a0\u0442\u0438\u043f\u0430 DataFrame:<\/p>\n<pre><code>val raw = spark         .read         .option(\"header\", \"true\")         .option(\"inferSchema\", \"true\")         .csv(s\"$basePath\/data\/BankChurners.csv\")<\/code><\/pre>\n<p>\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>basePath<\/code>\u00a0\u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u043f\u0443\u0442\u044c \u043a \u0440\u0430\u0431\u043e\u0447\u0435\u043c\u0443 \u043a\u0430\u0442\u0430\u043b\u043e\u0433\u0443 \u044d\u0442\u043e\u0433\u043e \u043f\u0440\u043e\u0435\u043a\u0442\u0430.<\/p>\n<p>\u0412 \u043e\u043f\u0438\u0441\u0430\u043d\u0438\u0438 \u044d\u0442\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u0441\u043a\u0430\u0437\u0430\u043d\u043e: \u201cPLEASE IGNORE THE LAST 2 COLUMNS (NAIVE BAYES CLAS\u2026)\u201d. \u0410 \u043f\u0435\u0440\u0432\u0430\u044f \u043a\u043e\u043b\u043e\u043d\u043a\u0430 \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0443\u043d\u0438\u043a\u0430\u043b\u044c\u043d\u044b\u0439 \u0438\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440 \u043a\u043b\u0438\u0435\u043d\u0442\u0430, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u0434\u043b\u044f \u043f\u043e\u0441\u0442\u0440\u043e\u0435\u043d\u0438\u044f \u043c\u043e\u0434\u0435\u043b\u0438 \u0441\u043e\u0432\u0435\u0440\u0448\u0435\u043d\u043d\u043e \u043d\u0435 \u043d\u0443\u0436\u0435\u043d.<\/p>\n<p>\u041f\u043e\u0434\u0433\u043e\u0442\u043e\u0432\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u043a\u043e\u043b\u043e\u043d\u043e\u043a, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043d\u0430\u0434\u043e \u0438\u0441\u043a\u043b\u044e\u0447\u0438\u0442\u044c \u0438\u0437 \u0437\u0430\u0433\u0440\u0443\u0436\u0435\u043d\u043d\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u2013 \u044d\u0442\u043e \u043f\u0435\u0440\u0432\u0430\u044f \u0438 \u0434\u0432\u0435 \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u0438\u0435 \u043a\u043e\u043b\u043e\u043d\u043a\u0438. \u041f\u043e\u043b\u0443\u0447\u0438\u043c \u0441\u043f\u0438\u0441\u043e\u043a \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0438\u0437 DataFrame, \u0432\u044b\u0434\u0435\u043b\u0438\u043c \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u0438\u0435 \u0434\u0432\u0430 \u044d\u043b\u0435\u043c\u0435\u043d\u0442\u0430 \u0438 \u0434\u043e\u0431\u0430\u0432\u0438\u043c \u043f\u0435\u0440\u0432\u044b\u0439.<\/p>\n<pre><code>val columns: Array[String] = raw.columns val columnsLen: Int = columns.length val colsToDrop: Array[String] = columns.slice(columnsLen - 2, columnsLen) :+ columns.head<\/code><\/pre>\n<p>\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>colsToDrop<\/code>\u00a0\u2013 \u044d\u0442\u043e \u043c\u0430\u0441\u0441\u0438\u0432 \u0438\u043c\u0451\u043d \u043a\u043e\u043b\u043e\u043d\u043e\u043a, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043d\u0430\u0434\u043e \u0438\u0441\u043a\u043b\u044e\u0447\u0438\u0442\u044c \u0438\u0437 \u0437\u0430\u0433\u0440\u0443\u0436\u0435\u043d\u043d\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>\u0414\u043b\u044f \u0443\u0434\u0430\u043b\u0435\u043d\u0438\u044f \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u0438\u0437 DataFrame \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043c\u0435\u0442\u043e\u0434\u00a0<code>drop<\/code>, \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u0430\u043c\u0438 \u043a\u043e\u0442\u043e\u0440\u043e\u0433\u043e \u044f\u0432\u043b\u044f\u0435\u0442\u0441\u044f \u043e\u0434\u043d\u043e \u0438\u043b\u0438 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043d\u0430\u0437\u0432\u0430\u043d\u0438\u0439 \u043a\u043e\u043b\u043e\u043d\u043e\u043a \u2013 \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u044b \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u043e\u0439 \u0434\u043b\u0438\u043d\u044b. \u0427\u0442\u043e\u0431\u044b \u043f\u0440\u0435\u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u0442\u044c \u043c\u0430\u0441\u0441\u0438\u0432 \u0432 \u0430\u0440\u0433\u0443\u043c\u0435\u043d\u0442\u044b \u043c\u0435\u0442\u043e\u0434\u0430 \u0432 Scala \u043f\u0440\u0438\u043c\u0435\u043d\u044f\u0435\u0442\u0441\u044f \u043a\u043e\u043d\u0441\u0442\u0440\u0443\u043a\u0446\u0438\u044f\u00a0<code>array: _*<\/code><\/p>\n<pre><code>val df = raw.drop(colsToDrop: _*)<\/code><\/pre>\n<p>\u0418\u0442\u0430\u043a, \u043f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f\u00a0<code>df<\/code>\u00a0\u0442\u0438\u043f\u0430 DataFrame \u0441\u043e\u0434\u0435\u0440\u0436\u0438\u0442 \u0438\u0441\u0445\u043e\u0434\u043d\u044b\u0439 \u043d\u0430\u0431\u043e\u0440 \u0434\u0430\u043d\u043d\u044b\u0445 \u0431\u0435\u0437 \u043f\u0435\u0440\u0432\u043e\u0439 \u0438 \u0434\u0432\u0443\u0445 \u043f\u043e\u0441\u043b\u0435\u0434\u043d\u0438\u0445 \u043a\u043e\u043b\u043e\u043d\u043e\u043a. \u041f\u043e\u043b\u0435\u0437\u043d\u043e \u043f\u043e\u0441\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u043d\u0430 \u043d\u0435\u0441\u043a\u043e\u043b\u044c\u043a\u043e \u043f\u0435\u0440\u0432\u044b\u0445 \u0437\u0430\u043f\u0438\u0441\u0435\u0439 \u044d\u0442\u043e\u0433\u043e \u043d\u0430\u0431\u043e\u0440\u0430:<\/p>\n<pre><code>df.show(5, truncate = false)<\/code><\/pre>\n<pre><code>+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+ |Attrition_Flag   |Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio| +-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+ |Existing Customer|45          |M     |3              |High School    |Married       |$60K - $80K    |Blue         |39            |5                       |1                     |3                    |12691.0     |777                |11914.0        |1.335               |1144           |42            |1.625              |0.061                | |Existing Customer|49          |F     |5              |Graduate       |Single        |Less than $40K |Blue         |44            |6                       |1                     |2                    |8256.0      |864                |7392.0         |1.541               |1291           |33            |3.714              |0.105                | |Existing Customer|51          |M     |3              |Graduate       |Married       |$80K - $120K   |Blue         |36            |4                       |1                     |0                    |3418.0      |0                  |3418.0         |2.594               |1887           |20            |2.333              |0.0                  | |Existing Customer|40          |F     |4              |High School    |Unknown       |Less than $40K |Blue         |34            |3                       |4                     |1                    |3313.0      |2517               |796.0          |1.405               |1171           |20            |2.333              |0.76                 | |Existing Customer|40          |M     |3              |Uneducated     |Married       |$60K - $80K    |Blue         |21            |5                       |1                     |0                    |4716.0      |0                  |4716.0         |2.175               |816            |28            |2.5                |0.0                  | +-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+ only showing top 5 rows<\/code><\/pre>\n<h4>\u041e\u043f\u0440\u0435\u0434\u0435\u043b\u044f\u0435\u043c \u0442\u0438\u043f\u044b \u043a\u043e\u043b\u043e\u043d\u043e\u043a<\/h4>\n<p>\u0414\u043b\u044f \u043f\u043e\u043d\u0438\u043c\u0430\u043d\u0438\u044f \u0434\u0430\u043d\u043d\u044b\u0445 \u043f\u043e\u043b\u0435\u0437\u043d\u043e \u0443\u0437\u043d\u0430\u0442\u044c \u043a\u043e\u0433\u043e \u0442\u0438\u043f\u0430 \u043a\u043e\u043b\u043e\u043d\u043a\u0438 \u0435\u0441\u0442\u044c \u0432 \u043d\u0430\u0431\u043e\u0440\u0435 \u0434\u0430\u043d\u043d\u044b\u0445.<\/p>\n<p>\u0427\u0430\u0449\u0435 \u0432\u0441\u0435\u0433\u043e \u0434\u043b\u044f \u0432\u044b\u0432\u043e\u0434\u0430 \u0441\u0445\u0435\u043c\u044b DataFrame \u0438\u0441\u043f\u043e\u043b\u044c\u0437\u0443\u0435\u0442\u0441\u044f \u043c\u0435\u0442\u043e\u0434\u00a0<code>printSchema<\/code>:<\/p>\n<pre><code>df.printSchema<\/code><\/pre>\n<pre><code>root  |-- Attrition_Flag: string (nullable = true)  |-- Customer_Age: integer (nullable = true)  |-- Gender: string (nullable = true)  |--<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-330015","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/330015","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=330015"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/330015\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=330015"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=330015"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=330015"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}