<![CDATA[PACE Big Data Workshop]]>

595496 event 1504630162 1505147676 <![CDATA[PACE Big Data Workshop]]> About this workshop:

This workshop is sponsored by the NSF's XSEDE (The Extreme Science and Engineering Development Environment, https://www.xsede.org/) program. Staff members from Texas Advanced Computing Center (https://www.tacc.utexas.edu/) will teach the workshop. The workshop is organized as four separate sessions to cover various topics in Big Data Analysis. Although participants are strongly encouraged to attend all sessions, the workshop is designed in a way such that participants may just attend selected sessions based on their background, schedule and needs.

About Instructors:

Ruizhu Huang is a research associate in the data intensive computing group at TACC. He has years of experience in big data analytics, machine learning, and data visualization. He has involved in various projects developing technologies that bridge the gap between traditional machine learning approaches and next-generation, data intensive computing methods involving High-Performance Computing (HPC) resources

Amit Gupta is a Research Engineering/Scientist Associate III in the Data Mining and Statistics group at TACC. His research interests are in Distributed Systems and Tools to enable scaling of Big Data Applications on HPC infrastructure, Parallel Programming and Information Retrieval Systems for text. He has extensive experience with various applications ranging from scaling Transportation Simulations to Text Mining of Biological literature. He earned an MS in Computer Science from the University of Colorado at Boulder with Thesis research in the area of Operating Systems.

Dr. Weijia Xu is a research scientist and manager of Data Mining and Statistics group at TACC. He received his Ph.D. in Computer Science from The University of Texas At Austin. Dr. Xu has over 50 peer-reviewed conference and journal publications in similarity-based data retrieval, data analysis, and information visualization with data from various scientific domains. He has served on program committees for several workshops and conferences in big data and high-performance computing area.

Part One: Introduction to Hadoop and Spark [register here]

Time: Sept 28 08:30am-12:30pm

Location: Marcus Nano Rm 1116

Capacity: 30 people

The session will focus on introducing Hadoop and Spark cluster to beginner, the topic includes:

basic concepts used in MapReduce programming model
major components of a Hadoop cluster
how to get started with Hadoop on your own computer and with computing resources at TACC
introduce Spark programming models and how Spark can work with a Hadoop cluster
different ways to use Hadoop and Spark for analysis

Participants do not need have any particular programming background, but working knowledge of Linux operating system is preferred. Class includes 3 hours lecture and 1 hour hands-on.

No show fee $25.00 applies if you don't show up in the session without cancelling it 5 days before the class.

Part Two: Developing a scalable application with Spark [register here]

Time: Sept 28 1:30pm-5:30pm

Location: Marcus Nano Rm 1116

Capacity: 30 people

This session will focus on how to develop a scalable application with Spark programming model, the topic includes:

review Spark programming model
basic introduction to the Scala programming language
how to run a Spark application
keys features to make scalable application
how to get started development using Spark after the class

Participant is expected to have prior knowledge on the concept of Hadoop and Spark cluster, knowledge of any programming language is preferred but not required.Class includes 3 hours lecture and 1 hour hands-on.

No show fee $25.00 applies if you don't show up in the session without cancelling it 5 days before the class.

Part Three: Common Practices on Hadoop and Spark Ecosystem [register here]

Time: Sept 29 08:30am-12:30pm

Location: Marcus Nano Rm 1116

Capacity: 30 people

This session will focus on general practices for practical analysis problem, the topic includes:

running batch jobs with different cluster deployment mode
running interactive jobs
explore existing libraries and applications including Hadoop streaming, MLlib, SparkSQL and Graph X
Using Hadoop/Spark with R and Python

Participants should have basic knowledge, experience and are comfortable with coding with knowledge of the Hadoop system, concepts of parallelism. Class includes 3 hours lecture and 1 hour hands-on.

No show fee $25.00 applies if you don't show up in the session without cancelling it 5 days before the class.

Part Four: Advanced Topic on Big Data Analysis [register here]

Time: Sept 29 01:30pm-03:30pm

Location: Marcus Nano Rm 1116

Capacity: 30 people

This session will cover more algorithm details and also provides a hands-on consultation for GT researchers' application, we will collect the use cases before the session, and walk through the selected use cases in details to demonstrate how to resolve the real world problem more efficiently.

]]> This workshop is provided by Texas Advanced Computing Center (TACC) researchers, and the aim is to introduce the Big Data Toolset to GT researchers and help researchers to identify and map their research problem to Big Data world, and find solution to the problem in the hand. There are four sessions, and researchers can choose one or more sessions to attend based on programming level and experience.

]]> Fang (Cherry) Liu (Ph.D.)

fang.liu at gatech.edu

]]> <![CDATA[Marcus Nano Rm 1116]]> 337231 1789 15092 175412 167041 9167