The goal of this example is to make a small Java app which uses Spark to count the number of lines of a text file, or lines which contain some given word.
We will work with Spark 2.1.0 and I suppose that the following are installed:
- Maven 3
- Eclipse
- Spark (note: you don’t really need Spark installed but it is convenient to have the shell to test stuff)
The complete (and updated) code is available on GitHub: https://github.com/rgugliel/first-spark-project
Creating the project
In the Eclipse workspace folder, we create a dummy project called first-spark-project with the following command:
cd ~/workspace mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=first-spark-project -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
We will then build the project and create Eclipse specific files:
cd first-spark-project/ mvn clean package eclipse:eclipse
Importing the project into Eclipse
We know have to importe the project into Eclipse. In order to do that, open Eclipse and use:
File > Import > Existing Projects into Workspace
Then, choose the root directory of your project. Click « Finish »
The src/main/java/com/mycompany/app/App.java file should run and display a nice « Hello World »
Converting the project
Right click on your project and then: Configure > Convert to a Maven Project
Adding the dependences
With a text editor, modify the file pom.xml and add inside the « <dependencies>…</dependencies> », add the following:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.1.0</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>2.11.8</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-xml_2.11.0-M4</artifactId> <version>1.0-RC1</version> </dependency>
Check that everything works correctly:
cd ~/workspace/first-spark-project/ mvn compile
The program
In the same folder as the pom.xml file, create a text file called linescount.txt with a few lines of text inside.
Then, replace the contents of the src/main/java/com/mycompany/app/App.java file with the following
package com.mycompany.app; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; /** * Hello world! * */ public class App { public static void main( String[] args ) { System.out.println( "Hello World!" ); SparkConf conf = new SparkConf().setAppName("firstSparkProject").setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext(conf); String path = "linescount.txt"; System.out.println("Trying to open: " + path); JavaRDD<String> lines = sc.textFile(path.toString()); System.out.println("Lines count: " + lines.count()); sc.stop(); } }
If you execute the file it should display the number of lines contained in linescount.txt
Some remarks:
- « local[*] » indicates to run the computations locally with as many threads as possible
- You can control the verbosity of the output with
sc.setLogLevel("WARN");
(possible values are: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN)
A variant
We can also use simple lambda functions to do some computations. We use for this example the « README.md » file of Spark.
String path = "README.md"; System.out.println("Trying to open: " + path); JavaRDD<String> jrdd = sc.textFile(path.toString()); // Number of lines System.out.println("# lines: " + jrdd.count() ); // Number of lines with more than 5 words System.out.println("# > 5 words: " + jrdd.filter(line -> line.split(" ").length > 5 ).count() ); // Number of lines which contains the word "Spark" System.out.println("# contains Spark: " + jrdd.filter(line -> line.contains("Spark") ).count() );
Laisser un commentaire