A first project with Spark, Java, Maven and Eclipse

The goal of this example is to make a small Java app which uses Spark to count the number of lines of a text file, or lines which contain some given word.
We will work with Spark 2.1.0 and I suppose that the following are installed:

Maven 3
Eclipse
Spark (note: you don’t really need Spark installed but it is convenient to have the shell to test stuff)

The complete (and updated) code is available on GitHub: https://github.com/rgugliel/first-spark-project

Creating the project

In the Eclipse workspace folder, we create a dummy project called first-spark-project with the following command:

 cd ~/workspace
mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=first-spark-project -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

We will then build the project and create Eclipse specific files:

 cd first-spark-project/
mvn clean package eclipse:eclipse

Importing the project into Eclipse

We know have to importe the project into Eclipse. In order to do that, open Eclipse and use:
File > Import > Existing Projects into Workspace
Then, choose the root directory of your project. Click « Finish »

The src/main/java/com/mycompany/app/App.java file should run and display a nice « Hello World »

Converting the project

Right click on your project and then: Configure > Convert to a Maven Project

Adding the dependences

With a text editor, modify the file pom.xml and add inside the « <dependencies>…</dependencies> », add the following:

 <dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.11</artifactId>
	<version>2.1.0</version>
</dependency>

<dependency>
<groupId>org.scala-lang</groupId>
	<artifactId>scala-library</artifactId>
	<version>2.11.8</version>
</dependency>

<dependency>
	<groupId>org.scala-lang</groupId>
	<artifactId>scala-xml_2.11.0-M4</artifactId>
	<version>1.0-RC1</version>
</dependency>

Check that everything works correctly:

 cd ~/workspace/first-spark-project/
mvn compile

The program

In the same folder as the pom.xml file, create a text file called linescount.txt with a few lines of text inside.

Then, replace the contents of the src/main/java/com/mycompany/app/App.java file with the following

package com.mycompany.app;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

/**
* Hello world!
*
*/
public class App 
{
	public static void main( String[] args )
	{
		System.out.println( "Hello World!" );

		SparkConf conf = new SparkConf().setAppName("firstSparkProject").setMaster("local[*]");
		JavaSparkContext sc = new JavaSparkContext(conf);
		String path = "linescount.txt";

		System.out.println("Trying to open: " + path);

		JavaRDD<String> lines = sc.textFile(path.toString());
		System.out.println("Lines count: " + lines.count());
		sc.stop();
	}
}

If you execute the file it should display the number of lines contained in linescount.txt

Some remarks:

« local[*] » indicates to run the computations locally with as many threads as possible
You can control the verbosity of the output with
```
sc.setLogLevel("WARN");
```
(possible values are: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN)

A variant

We can also use simple lambda functions to do some computations. We use for this example the « README.md » file of Spark.

String path = "README.md";
System.out.println("Trying to open: " + path);
		
JavaRDD<String> jrdd = sc.textFile(path.toString());

// Number of lines
System.out.println("# lines: " + jrdd.count() );

// Number of lines with more than 5 words
System.out.println("# > 5 words: " + jrdd.filter(line -> line.split(" ").length > 5 ).count() );

// Number of lines which contains the word "Spark"
System.out.println("# contains Spark: " + jrdd.filter(line -> line.contains("Spark") ).count() );