Highlights of Spring for Apache Hadoop 1.0.0 M2
I am happy to announce that the second milestone (1.0.0.M2) of Spring for Apache Hadoop project is available. In this blog post, I would like to quickly highlight the major new features in M2.
HBase DAO support
One of the most versatile and powerful feature in Spring Framework is the Data Access Object (or DAO) support. With Spring for Hadoop 1.0.0 M2, the same functionality was added for HBase. Users of the popular template and callback pattern should feel right at home as the framework handles the table lookup, resource cleanup and exception conversion, letting the developer focus on what really matters. See the API and reference docs for more information. By the way, we also included a new sample in the distribution,
hbase-crud, to help you get started right away.
Cascading Taps
In M2, we have expanded the integration with
Cascading library by
Taps for Spring Framework and Spring Integration resources. The richness of Spring Integration adapters (whether inbound or outbound) such as File, TCP, Twitter, FTP, RSS (just to name a few) is now available to Cascading (and its extensions such as
Cascalog or
Scalding). And we are just getting started - expect more news on this front.
Hadoop Security
With M2, moving from a vanilla Hadoop install (such as a dev machine) to a fully Kerberos-secured Hadoop cluster is transparent. The File-System, Map/Reduce and Pig components are all security-aware, executing under proper credentials and supporting user impersonation. See the dedicated
chapter for more information.
Enhanced vanilla Map/Reduce support
Since the beginning, Spring for Apache Hadoop offered extensive support for Map/Reduce jobs - whether it is vanilla or traditional Java Map/Reduce,
streaming or
tooling. In M2, we have added
support for Hadoop
generic options across the board, making job provisioning, either by naming resources individually or through pattern matching, a one-liner.
Further more, we have enhanced the bootstrapping of jar-based jobs - rather then requiring the classes to be on the classpath, the job can be fully loaded, in isolation, from the jar. The classes (and their dependencies) do not
leak into the application which avoids all sorts of versioning conflicts and dependency
creep. The tool declaration has been improved to automatically read the Jar metadata and its
Main-Class, offering a powerful, fully managed
replacement to Hadoop shell
jar invocations.
Two New Samples
Last but not least, two new samples have been added to the distribution:
hbase-crud, which I mentioned before showcasing the declarative and programmatic HBase support and
pig-scripting, demoing the JVM and Pig scripting: the former doing data preparations in HDFS for the latter, which does data analysis. There are more samples in the pipeline and if you would like to see anything in particular,
tell us.
I hope you enjoy this new milestone.
Go ahead, grab 1.0.0 M2, take it for a spin and let us know what you think!
Other News: Project Serengeti
As far as new releases go, Spring for Apache Hadoop 1.0.0 M2 is not the only news on the Hadoop front. Today, VMware takes the curtains off project Serengeti, for virtualized and Highly Available Hadoop. See Richard McDougall's blog post on the motivations behind it, the current status and road-map.