The Hadoop++ Project
Jens Dittrich
Fachrichtung Informatik - Saarbruecken
10 Dec 2010, 11:00 am - 12:00 pm
Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206
simultaneous videocast to Kaiserslautern building G26, room 206
SWS Colloquium
The Hadoop++ Project
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. ...
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. ...
The Hadoop++ Project
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. We propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all. To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing. In this talk I will present results from a VLDB 2010 paper as well as more recent work.
Link: http://infosys.cs.uni-saarland.de/hadoop++.php
Read more
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. We propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all. To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing. In this talk I will present results from a VLDB 2010 paper as well as more recent work.
Link: http://infosys.cs.uni-saarland.de/hadoop++.php