{"489091":{"#nid":"489091","#data":{"type":"event","title":"Accelerating Advanced Analytics - Arun Kumar","body":[{"value":"\u003Cp class=\u0022p1\u0022\u003ETitle:\u0026nbsp;Accelerating Advanced Analytics\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003E\u003Cbr \/\u003E\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EAbstract:\u003C\/p\u003E\u003Cp class=\u0022p2\u0022\u003EAdvanced analytics -- the analysis of large and complex data with machine\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Elearning (ML) -- is becoming ubiquitous, with a growing demand for\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eadvanced analytics tools in the enterprise domains. However, there exist\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eseveral challenging bottlenecks in the end-to-end process of building and\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Edeploying advanced analytics applications. My research focuses on\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eabstractions, algorithms, and systems to mitigate such bottlenecks and\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eaccelerate advanced analytics from a data management standpoint.\u003C\/p\u003E\u003Cp class=\u0022p2\u0022\u003E\u0026nbsp;\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EIn this talk, I will focus on my work on mitigating one such pervasive\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ebottleneck in the process of feature engineering for ML -- joins of\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Emultiple tables. Many real-world datasets are multi-table, connected by\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ekey-foreign key relationships, but almost all ML toolkits expect\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Esingle-table inputs. This forces data scientists to join all tables and\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ematerialize a single table that collects all features. Alas, such joins\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eoften cause the output to blow up in size, which slows down ML, increases\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ecosts, and leads to data maintenance headaches. In my work, I show how it\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eis possible to mitigate these issues by \u0022avoiding joins physically,\u0022\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ei.e., pushing ML down through joins. This reduces runtime without\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eaffecting accuracy. Going further, I apply statistical learning theory to\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eshow how it is often possible to also \u0022avoid joins logically,\u0022 i.e.,\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eignore entire tables outright without losing much accuracy, but achieving\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Esignificant runtime gains.\u003C\/p\u003E\u003Cp class=\u0022p2\u0022\u003E\u0026nbsp;\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EBio:\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EArun Kumar is a Ph.D. candidate at the University of Wisconsin-Madison.\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EHis primary research interests are in data management and its\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eintersection with machine learning. He is co-advised by Jeffrey Naughton\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eand Jignesh M. Patel, and has also worked closely with Christopher Re and\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EXiaojin Zhu. Systems and ideas from his research have been shipped in\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Eproducts by EMC, Oracle, Cloudera, and IBM. A paper co-authored by him\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ewas accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EAnthony C. Klug NCR Fellowship in database systems in 2015. He received\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003Ehis M.S. from UW-Madison in 2011 and his B.Tech. from IIT Madras in 2009.\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003E\u003Cbr \/\u003E\u003C\/p\u003E\u003Cp class=\u0022p1\u0022\u003EWebpage:\u003C\/p\u003E\u003Cp class=\u0022p3\u0022\u003E\u003Ca href=\u0022http:\/\/pages.cs.wisc.edu\/~arun\/%3Chttp:\/\/pages.cs.wisc.edu\/%7Earun\/%3E\u0022\u003Ehttp:\/\/pages.cs.wisc.edu\/~arun\/\u0026lt;http:\/\/pages.cs.wisc.edu\/%7Earun\/\u0026gt;\u003C\/a\u003E\u003C\/p\u003E","summary":null,"format":"limited_html"}],"field_subtitle":"","field_summary":"","field_summary_sentence":[{"value":"Accelerating Advanced Analytics - Arun Kumar"}],"uid":"28150","created_gmt":"2016-01-21 15:38:23","changed_gmt":"2017-04-13 21:16:58","author":"Birney Robert","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2016-02-02T10:00:00-05:00","event_time_end":"2016-02-02T11:00:00-05:00","event_time_end_last":"2016-02-02T11:00:00-05:00","gmt_time_start":"2016-02-02 15:00:00","gmt_time_end":"2016-02-02 16:00:00","gmt_time_end_last":"2016-02-02 16:00:00","rrule":null,"timezone":"America\/New_York"},"extras":[],"hg_media":{"489201":{"id":"489201","type":"image","title":"Arun Kumar","body":null,"created":"1453435200","gmt_created":"2016-01-22 04:00:00","changed":"1475895245","gmt_changed":"2016-10-08 02:54:05","alt":"Arun Kumar","file":{"fid":"204398","name":"facecrop.jpg","image_path":"\/sites\/default\/files\/images\/facecrop.jpg","image_full_path":"http:\/\/www.tlwarc.hg.gatech.edu\/\/sites\/default\/files\/images\/facecrop.jpg","mime":"image\/jpeg","size":1316653,"path_740":"http:\/\/www.tlwarc.hg.gatech.edu\/sites\/default\/files\/styles\/740xx_scale\/public\/images\/facecrop.jpg?itok=Q7zH3Fie"}}},"media_ids":["489201"],"groups":[{"id":"47223","name":"College of Computing"},{"id":"50875","name":"School of Computer Science"}],"categories":[],"keywords":[{"id":"654","name":"College of Computing"},{"id":"109","name":"Georgia Tech"},{"id":"166941","name":"School of Computer Science"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1795","name":"Seminar\/Lecture\/Colloquium"}],"invited_audience":[{"id":"78751","name":"Undergraduate students"},{"id":"78761","name":"Faculty\/Staff"},{"id":"78771","name":"Public"},{"id":"174045","name":"Graduate students"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[{"value":"\u003Cp\u003ESusie McClain\u003C\/p\u003E\u003Cp\u003E\u003Ca href=\u0022mailto:smcclain@cc.gatech.edu\u0022\u003Esmcclain@cc.gatech.edu\u003C\/a\u003E\u003C\/p\u003E","format":"limited_html"}],"email":[],"slides":[],"orientation":[],"userdata":""}}}