{"613636":{"#nid":"613636","#data":{"type":"news","title":"School of Computer Science Professor Works with Microsoft Research to Make Data Transformation Easier","body":[{"value":"\u003Cp\u003EThe growing field of self-service data transformation took a big step forward with the \u003Ca href=\u0022https:\/\/appsource.microsoft.com\/en-us\/product\/office\/WA104380727?src=office\u0026amp;corrid=8b212240-d845-415c-a651-f8a07215acf9\u0026amp;omexanonuid=bf0ab165-67fe-478f-9d82-b70d15b147e2\u0022\u003ETransform-Data-by-Example (TDE) service\u003C\/a\u003E. TDE works as a search engine for data transformation libraries, alleviating the difficulty of data wrangling.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EThe project was at an early stage of conception at Microsoft Research when School of Computer Science Assistant Professor \u003Cstrong\u003E\u003Ca href=\u0022https:\/\/www.cc.gatech.edu\/~xchu33\/\u0022\u003EXu Chu\u003C\/a\u003E\u003C\/strong\u003E joined and helped contribute to its success.\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026ldquo;Data transformation is a big part of data cleaning, which is very time consuming and expensive,\u0026rdquo; Chu said.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EWhen data comes from different sources or is manually entered, it\u0026rsquo;s often inconsistent and challenging to work with until it\u0026rsquo;s prepared. Data preparation involves cleaning, standardizing, and transforming raw data sets so they can be analyzed effectively. Data scientists can spend up to 80 percent of their time just transforming data.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDevelopers have created custom code libraries for tasks, such as name parsing and address standardization, that data scientists might need for transformation. Yet these libraries are only useful if the data scientist can find them. Finding them hasn\u0026rsquo;t always been easy \u0026ndash; until now.\u003C\/p\u003E\r\n\r\n\u003Cp\u003ETDE indexes thousands of functions from GitHub and Stackoverflow, so users only need to provide their desired output for a few input examples to find the transformation program they need. Currently, TDE has a 72 percent accuracy rate for synthesizing correct transformation programs.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EThe front-end of TDE is a Microsoft Excel plug-in that users can download from Office. Once the user provides a few input\/output examples, TDE connects with the back-end on Microsoft Azure\u0026rsquo;s cloud to search thousands of functions and synthesize programs using relevant functions that will work for the user. This leverages techniques from the program synthesis field.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026ldquo;This is a great example of how technologies from non-database domains can help with hard data management problems such as data cleaning,\u0026rdquo; Chu said.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EHe believes this type of research has a lot of potential. For example, Chu is working on a project of using matrix and tensor factorization techniques in statistics and machine learning to do data cleaning.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EThe work on TDE was presented at the \u003Ca href=\u0022http:\/\/vldb2018.lncc.br\/\u0022\u003EVery Large Databases\u003C\/a\u003E conference in Rio di Janeiro in late August. Chu coauthored the paper \u003Ca href=\u0022https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2018\/06\/p1142-He.pdf\u0022\u003E\u003Cem\u003ETransform Data by Example (TDE): An Extensible Search Engine for Data Transformations\u003C\/em\u003E\u003C\/a\u003E with Microsoft Research\u0026rsquo;s \u003Cstrong\u003ESurajit Chaudhuri\u003C\/strong\u003E, \u003Cstrong\u003EKris Ganjam\u003C\/strong\u003E, \u003Cstrong\u003EYeye He\u003C\/strong\u003E, and \u003Cstrong\u003EVivek Narasayya\u003C\/strong\u003E, and Twitter\u0026rsquo;s \u003Cstrong\u003EYudian Zheng\u003C\/strong\u003E. Earlier, a demo of the work was presented at SIGMOD 2018 in Houston.\u003C\/p\u003E\r\n","summary":null,"format":"limited_html"}],"field_subtitle":"","field_summary":"","field_summary_sentence":[{"value":"TDE works as a search engine for data transformation libraries, alleviating the difficulty of data wrangling."}],"uid":"34541","created_gmt":"2018-11-01 17:57:31","changed_gmt":"2018-11-01 18:40:16","author":"Tess Malone","boilerplate_text":"","field_publication":"","field_article_url":"","dateline":{"date":"2018-11-01T00:00:00-04:00","iso_date":"2018-11-01T00:00:00-04:00","tz":"America\/New_York"},"extras":[],"hg_media":{"613660":{"id":"613660","type":"image","title":"Magnifying Glass","body":null,"created":"1541097311","gmt_created":"2018-11-01 18:35:11","changed":"1541097311","gmt_changed":"2018-11-01 18:35:11","alt":"Magnifying Glass","file":{"fid":"233599","name":"2561885967_f5f0be5834_b-1.jpg","image_path":"\/sites\/default\/files\/images\/2561885967_f5f0be5834_b-1.jpg","image_full_path":"http:\/\/www.tlwarc.hg.gatech.edu\/\/sites\/default\/files\/images\/2561885967_f5f0be5834_b-1.jpg","mime":"image\/jpeg","size":712117,"path_740":"http:\/\/www.tlwarc.hg.gatech.edu\/sites\/default\/files\/styles\/740xx_scale\/public\/images\/2561885967_f5f0be5834_b-1.jpg?itok=c0eucxtz"}}},"media_ids":["613660"],"groups":[{"id":"47223","name":"College of Computing"},{"id":"50875","name":"School of Computer Science"}],"categories":[],"keywords":[],"core_research_areas":[{"id":"39431","name":"Data Engineering and Science"}],"news_room_topics":[],"event_categories":[],"invited_audience":[],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[{"value":"\u003Cp\u003ETess Malone, Communications Officer\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Ca href=\u0022mailto:tess.malone@cc.gatech.edu\u0022\u003Etess.malone@cc.gatech.edu\u003C\/a\u003E\u003C\/p\u003E\r\n","format":"limited_html"}],"email":["tess.malone@cc.gatech.edu"],"slides":[],"orientation":[],"userdata":""}}}