Abstract class for parameter case classes.
An example app for binary classification.
An example app for binary classification. Run with
bin/run-example org.apache.spark.examples.mllib.BinaryClassification
A synthetic dataset is located at data/mllib/sample_binary_classification_data.txt
.
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example demonstrating a bisecting k-means clustering in spark.mllib.
An example demonstrating a bisecting k-means clustering in spark.mllib.
Run with
bin/run-example mllib.BisectingKMeansExample
An example app for summarizing multivariate data from a file.
An example app for summarizing multivariate data from a file. Run with
bin/run-example org.apache.spark.examples.mllib.Correlations
By default, this loads a synthetic dataset from data/mllib/sample_linear_regression_data.txt
.
If you use it as a template to create your own app, please use spark-submit
to submit your app.
Compute the similar columns of a matrix, using cosine similarity.
Compute the similar columns of a matrix, using cosine similarity.
The input matrix must be stored in row-oriented dense format, one line per row with its entries separated by space. For example,
0.5 1.0 2.0 3.0 4.0 5.0
represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
Example invocation:
bin/run-example mllib.CosineSimilarity \ --threshold 0.1 data/mllib/sample_svm_data.txt
An example runner for decision trees and random forests.
An example runner for decision trees and random forests. Run with
./bin/run-example org.apache.spark.examples.mllib.DecisionTreeRunner [options]
If you use it as a template to create your own app, please use spark-submit
to submit your app.
Note: This script treats all features as real-valued (not categorical). To include categorical features, modify categoricalFeaturesInfo.
An example k-means app.
An example k-means app. Run with
./bin/run-example org.apache.spark.examples.mllib.DenseKMeans [options] <input>
If you use it as a template to create your own app, please use spark-submit
to submit your app.
Example for mining frequent itemsets using FP-growth.
Example for mining frequent itemsets using FP-growth. Example usage: ./bin/run-example mllib.FPGrowthExample \ --minSupport 0.8 --numPartition 2 ./data/mllib/sample_fpgrowth.txt
An example runner for Gradient Boosting using decision trees as weak learners.
An example runner for Gradient Boosting using decision trees as weak learners. Run with
./bin/run-example mllib.GradientBoostedTreesRunner [options]
If you use it as a template to create your own app, please use spark-submit
to submit your app.
Note: This script treats all features as real-valued (not categorical). To include categorical features, modify categoricalFeaturesInfo.
An example Latent Dirichlet Allocation (LDA) app.
An example Latent Dirichlet Allocation (LDA) app. Run with
./bin/run-example mllib.LDAExample [options] <input>
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example app for ALS on MovieLens data (http://grouplens.org/datasets/movielens/).
An example app for ALS on MovieLens data (http://grouplens.org/datasets/movielens/). Run with
bin/run-example org.apache.spark.examples.mllib.MovieLensALS
A synthetic dataset in MovieLens format can be found at data/mllib/sample_movielens_data.txt
.
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example app for summarizing multivariate data from a file.
An example app for summarizing multivariate data from a file. Run with
bin/run-example org.apache.spark.examples.mllib.MultivariateSummarizer
By default, this loads a synthetic dataset from data/mllib/sample_linear_regression_data.txt
.
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example Power Iteration Clustering http://www.icml2010.org/papers/387.pdf app.
An example Power Iteration Clustering http://www.icml2010.org/papers/387.pdf app. Takes an input of K concentric circles and the number of points in the innermost circle. The output should be K clusters - each cluster containing precisely the points associated with each of the input circles.
Run with
./bin/run-example mllib.PowerIterationClusteringExample [options] Where options include: k: Number of circles/clusters n: Number of sampled points on innermost circle.. There are proportionally more points within the outer/larger circles maxIterations: Number of Power Iterations
Here is a sample run and output:
./bin/run-example mllib.PowerIterationClusteringExample -k 2 --n 10 --maxIterations 15
Cluster assignments: 1 -> [0,1,2,3,4,5,6,7,8,9], 0 -> [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example app for randomly generated RDDs.
An example app for randomly generated RDDs. Run with
bin/run-example org.apache.spark.examples.mllib.RandomRDDGeneration
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example app for randomly generated and sampled RDDs.
An example app for randomly generated and sampled RDDs. Run with
bin/run-example org.apache.spark.examples.mllib.SampledRDDs
If you use it as a template to create your own app, please use spark-submit
to submit your app.
An example naive Bayes app.
An example naive Bayes app. Run with
./bin/run-example org.apache.spark.examples.mllib.SparseNaiveBayes [options] <input>
If you use it as a template to create your own app, please use spark-submit
to submit your app.
Estimate clusters on one stream of data and make predictions on another stream, where the data streams arrive as text files into two different directories.
Estimate clusters on one stream of data and make predictions on another stream, where the data streams arrive as text files into two different directories.
The rows of the training text files must be vector data in the form
[x1,x2,x3,...,xn]
Where n is the number of dimensions.
The rows of the test text files must be labeled data in the form
(y,[x1,x2,x3,...,xn])
Where y is some identifier. n must be the same for train and test.
Usage: StreamingKMeansExample <trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>
To run on your local machine using the two directories trainingDir
and testDir
,
with updates every 5 seconds, 2 dimensions per data point, and 3 clusters, call:
$ bin/run-example mllib.StreamingKMeansExample trainingDir testDir 5 3 2
As you add text files to trainingDir
the clusters will continuously update.
Anytime you add text files to testDir
, you'll see predicted labels using the current model.
Train a linear regression model on one stream of data and make predictions on another stream, where the data streams arrive as text files into two different directories.
Train a linear regression model on one stream of data and make predictions on another stream, where the data streams arrive as text files into two different directories.
The rows of the text files must be labeled data points in the form
(y,[x1,x2,x3,...,xn])
Where n is the number of features. n must be the same for train and test.
Usage: StreamingLinearRegressionExample <trainingDir> <testDir>
To run on your local machine using the two directories trainingDir
and testDir
,
with updates every 5 seconds, and 2 features per data point, call:
$ bin/run-example mllib.StreamingLinearRegressionExample trainingDir testDir
As you add text files to trainingDir
the model will continuously update.
Anytime you add text files to testDir
, you'll see predictions from the current model.
Train a logistic regression model on one stream of data and make predictions on another stream, where the data streams arrive as text files into two different directories.
Train a logistic regression model on one stream of data and make predictions on another stream, where the data streams arrive as text files into two different directories.
The rows of the text files must be labeled data points in the form
(y,[x1,x2,x3,...,xn])
Where n is the number of features, y is a binary label, and
n must be the same for train and test.
Usage: StreamingLogisticRegression <trainingDir> <testDir> <batchDuration> <numFeatures>
To run on your local machine using the two directories trainingDir
and testDir
,
with updates every 5 seconds, and 2 features per data point, call:
$ bin/run-example mllib.StreamingLogisticRegression trainingDir testDir 5 2
As you add text files to trainingDir
the model will continuously update.
Anytime you add text files to testDir
, you'll see predictions from the current model.
Perform streaming testing using Welch's 2-sample t-test on a stream of data, where the data stream arrives as text files in a directory.
Perform streaming testing using Welch's 2-sample t-test on a stream of data, where the data stream arrives as text files in a directory. Stops when the two groups are statistically significant (p-value < 0.05) or after a user-specified timeout in number of batches is exceeded.
The rows of the text files must be in the form Boolean, Double
. For example:
false, -3.92
true, 99.32
Usage: StreamingTestExample <dataDir> <batchDuration> <numBatchesTimeout>
To run on your local machine using the directory dataDir
with 5 seconds between each batch and
a timeout after 100 insignificant batches, call:
$ bin/run-example mllib.StreamingTestExample dataDir 5 100
As you add text files to dataDir
the significance test wil continually update every
batchDuration
seconds until the test becomes significant (p-value < 0.05) or the number of
batches processed exceeds numBatchesTimeout
.
Compute the principal components of a tall-and-skinny matrix, whose rows are observations.
Compute the principal components of a tall-and-skinny matrix, whose rows are observations.
The input matrix must be stored in row-oriented dense format, one line per row with its entries separated by space. For example,
0.5 1.0 2.0 3.0 4.0 5.0
represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
Compute the singular value decomposition (SVD) of a tall-and-skinny matrix.
Compute the singular value decomposition (SVD) of a tall-and-skinny matrix.
The input matrix must be stored in row-oriented dense format, one line per row with its entries separated by space. For example,
0.5 1.0 2.0 3.0 4.0 5.0
represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
An example app for linear regression.
An example app for linear regression. Run with
bin/run-example org.apache.spark.examples.mllib.LinearRegression
A synthetic dataset can be found at data/mllib/sample_linear_regression_data.txt
.
If you use it as a template to create your own app, please use spark-submit
to submit your app.
(Since version 2.0.0) Use ml.regression.LinearRegression or LBFGS
(Since version 2.0.0) Use ml.regression.LinearRegression or LBFGS
(Since version 2.0.0) Deprecated since LinearRegressionWithSGD is deprecated. Use ml.feature.PCA
(Since version 2.0.0) Use ml.regression.LinearRegression and the resulting model summary for metrics
Abstract class for parameter case classes. This overrides the toString method to print all case class fields by name and value.
Concrete parameter class.