Get ahead
VMware offers training and certification to turbo-charge your progress.
Learn moreNow that Thomas has just released a fifth milestone for Spring for Apache Hadoop, I'd like to use this opportunity to talk about recent development in its new feature, Spring YARN.
One strength in our Spring IO Platform is interoperability of its technologies. Great example of this is how Spring Boot and Spring YARN are able work together to create a better model for Hadoop YARN application development. In this blog post I'd like to show an example of a new Spring Yarn Application model which is heavily based on Spring Boot.
Development life cycle from a moment when a developer starts his work to a point when someone actually executes an application on a Hadoop cluster is a bit more complicated than just creating a few lines of code. Lets see what needs to be considered:
We believe that Spring YARN and Spring Boot provides a simple programming model to develop applications than can easily be test and deployed as either a YARN application or a traditional application.
At a high level, Spring YARN provides three components, YarnClient
, YarnAppmaster
and YarnContainer
that mirror the key processes in the YARN architecture. Taken together, these three components provide the foundation of the Spring YARN application model.
It has always been a cumbersome process to get your own code packaged and executed on a Hadoop cluster. One need to put compiled package in Hadoop's classpath or let Hadoop's tools to copy your package into Hadoop during a job submission. Once you get past WordCount, your code will depend on third party libraries that are not present in Hadoop's default classpath. How should you package you dependent libraries? Furthermore, what if your dependencies collide with libraries already part of Hadoop's default classpath.
Spring Boot helps to provide an elegant solution to these build and packaging issues. You either create an executable jar(sometimes called as uber or fat jar) which bundles your application code and all its dependencies into a single .jar file or create a zip file which can be exploded before code is about to be executed. The main difference between the two packaging formats is that the latter lets you re-use jars that are already available on Hadoop's default classpath.
In this guide we are going to show how these 3 components, YarnClient
, YarnAppmaster
and YarnContainer
are packaged into executable jars using Spring Boot. Internally Spring Boot rely heavy on application auto-configuration and Spring YARN adds its own auto-configuration magic. User can then concentrate on his or her own code and application configuration instead of spending a lot of time trying to understand how all the components should integrate with each others.
We will now show you how simple it is to create and deploy a custom application to a Hadoop cluster. Notice that there are no need to use XML.
Here you create ContainerApplication
and HelloPojo
classes.
@Configuration
@EnableAutoConfiguration
public class ContainerApplication {
public static void main(String[] args) {
SpringApplication.run(ContainerApplication.class, args);
}
@Bean
public HelloPojo helloPojo() {
return new HelloPojo();
}
}
In above ContainerApplication
, notice how we added @Configuration
in a class level itself and @Bean
for a helloPojo()
method. We previously mentioned YarnContainer
component which is an interface towards what you'd execute in your containers. You could define your custom YarnContainer
to implement this interface and wrap all logic inside of that implementation.
However, Spring YARN defaults to a DefaultYarnContainer
if none is defined and this default implementation expects to find a specific bean having the real logic what container is supposed to do. This effectively creates a simple model where, at minimum, only a simple POJO
is needed.
@YarnContainer
public class HelloPojo {
private static final Log log = LogFactory.getLog(HelloPojo.class);
@Autowired
private Configuration configuration;
@OnYarnContainerStart
public void publicVoidNoArgsMethod() {
log.info("Hello from HelloPojo");
log.info("About to list from hdfs root content");
FsShell shell = new FsShell(configuration);
for (FileStatus s : shell.ls(false, "/")) {
log.info(s);
}
}
}
HelloPojo
class is a simple POJO
in a sense that it doesn't extend any Spring YARN base classes. What we did in this class:
@YarnContainer
annotation.@OnYarnContainerStart
annotation@Autowired
a Hadoop's Configuration
class@YarnContainer
is a stereotype annotation itself having a Spring's @Component
defined in it. This is automatically marking a class to be a candidate having a @YarnContainer
functionality.
Within this class we can use @OnYarnContainerStart
annotation to mark a public method having no return type or arguments act as something what needs to be executed on Hadoop.
To demonstrate that we actually have some real functionality in this class, we simply use Spring Hadoop's @FsShell
to list entries from a root of a HDFS
file system. For this we need to have access to Hadoop's Configuration
which is prepared for you so that you can just autowire it.
Here you create a ClientApplication
class.
@EnableAutoConfiguration
public class ClientApplication {
public static void main(String[] args) {
SpringApplication.run(ClientApplication.class, args)
.getBean(YarnClient.class)
.submitApplication();
}
}
@EnableAutoConfiguration
tells Spring Boot to start adding beans based on classpath setting, other beans, and various property settings.The main()
method uses Spring Boot's SpringApplication.run()
method to launch an application. From there we simply request a bean of type YarnClient
and execute its submitApplication()
method. What happens next depends on application configuration, which we go through later in this guide.
Here you create an AppmasterApplication
class.
@EnableAutoConfiguration
public class AppmasterApplication {
public static void main(String[] args) {
SpringApplication.run(AppmasterApplication.class, args);
}
}
Application class for YarnAppmaster
looks even simpler than what we just did for ClientApplication
. Again the main()
method uses Spring Boot's SpringApplication.run()
method to launch an application.
One might argue that if you use this type of dummy class to basically fire up your application, could we just use a generic class for this? Well simple answer is yes, we even have a generic SpringYarnBootApplication
class just for this purpose. You'd define that to be your main class for an executable jar and you'd accomplish this during the gradle build.
In real life, however, you most likely need to start adding more custom functionality to your application component and you'd do that by start adding more beans. To do that you need to define a Spring @Configuration
or @ComponentScan
. AppmasterApplication
would then act as your main starting point to define more custom functionality.
Create a new yaml
configuration file.
spring:
yarn:
appName: yarn-boot-simple
applicationDir: /app/yarn-boot-simple/
fsUri: hdfs://localhost:8020
rmAddress: localhost:8032
schedulerAddress: localhost:8030
client:
appmasterFile: yarn-boot-simple-appmaster-0.1.0.jar
files:
- "file:build/libs/yarn-boot-simple-container-0.1.0.jar"
- "file:build/libs/yarn-boot-simple-appmaster-0.1.0.jar"
appmaster:
containerCount: 1
containerFile: yarn-boot-simple-container-0.1.0.jar
Final part for your application is its runtime configuration which glues all the components together which then can be called as a Spring YARN application. This configuration act as source for Spring Boot's @ConfigurationProperties
and contains relevant configuration properties which cannot be auto-discovered or otherwise needs to have an option to be overwritten by an end user.
You can then write your own defaults for your own environment. Because these @ConfigurationProperties
are resolved at runtime by Spring Boot, you even have an easy option to overwrite these properties either by using command-line options or provide additional configuration property files.
Sample code used in this blog can be found from our spring-hadoop-samples repo on GitHub.
Once you checkout our samples, issue a gradle build command from boot/yarn-boot-simple
directory.
$ cd boot/yarn-boot-simple
$ ./gradlew clean build
For this sample we wanted to keep the project structure simple. We don't go through the gradle build file in this blog but the sort story is that we will create three different jar files from one project. In reality, one would probably use a multi-project model where each sub-project creates its own jar file.
Now that you've successfully compiled and packaged your application, it's time to do the fun part and execute it on a Hadoop YARN.
Below listing shows files after a succesfull gradle build.
$ ls -lt build/libs/
-rw-r--r-- 1 hadoop hadoop 35975001 Feb 2 17:39 yarn-boot-simple-container-0.1.0.jar
-rw-r--r-- 1 hadoop hadoop 35973937 Feb 2 17:39 yarn-boot-simple-client-0.1.0.jar
-rw-r--r-- 1 hadoop hadoop 35973840 Feb 2 17:39 yarn-boot-simple-appmaster-0.1.0.jar
Simply run your executable client jar.
$ java -jar build/libs/yarn-boot-simple-client-0.1.0.jar
Using a Resource Manager UI you can see status of an application.
To find Hadoop's application logs, do a little find within a configured userlogs directory.
$ find hadoop/logs/userlogs/|grep std
hadoop/logs/userlogs/application_1391348442831_0001/container_1391348442831_0001_01_000002/Container.stdout
hadoop/logs/userlogs/application_1391348442831_0001/container_1391348442831_0001_01_000002/Container.stderr
hadoop/logs/userlogs/application_1391348442831_0001/container_1391348442831_0001_01_000001/Appmaster.stdout
hadoop/logs/userlogs/application_1391348442831_0001/container_1391348442831_0001_01_000001/Appmaster.stderr
Grep output logged by HelloPojo
class.
$ grep HelloPojo hadoop/logs/userlogs/application_1391348442831_0001/container_1391348442831_0001_01_000002/Container.stdout
[2014-02-02 17:40:38,314] boot - 11944 INFO [main] --- HelloPojo: Hello from HelloPojo
[2014-02-02 17:40:38,315] boot - 11944 INFO [main] --- HelloPojo: About to list from hdfs root content
[2014-02-02 17:40:41,134] boot - 11944 INFO [main] --- HelloPojo: FileStatus{path=hdfs://localhost:8020/; isDirectory=true; modification_time=1390823919636; access_time=0; owner=root; group=supergroup; permission=rwxr-xr-x; isSymlink=false}
[2014-02-02 17:40:41,135] boot - 11944 INFO [main] --- HelloPojo: FileStatus{path=hdfs://localhost:8020/app; isDirectory=true; modification_time=1391203430490; access_time=0; owner=jvalkealahti; group=supergroup; permission=rwxr-xr-x; isSymlink=false}
Congratulations! You've just developed a Spring YARN application!