Dr. Madhura Purnaprajna currently serves as Associate Professor at the Department of Computer Science, School of Engineering, Bengaluru.

Before joining Amrita, Madhura Purnaprajna was a post-doctoral fellow with an International Research Fellowship from the German Research Foundation (Deutsche Forschungsgemenischaft), at the Processor Architecture Lab, EPFL, Switzerland and the High performance Computing Lab, IISc., Bangalore

Her research interests are in Re-configurable Computing and Processor Architectures. She received her PhD in Electrical Engineering from the Heinz Nixdorf Institute, University of Paderborn, Germany. She has a Master's degree from University of Alberta, Canada. Before that, she spent about 4 years in the Indian Semiconductor Industry. Profile: Madhura Purnaprajna

Research Projects

Towards Next-generation Adaptable Computing
Unifying the design of Heterogeneous Systems



December 2009 Ph. D. University of Paderborn, Germany
January 2005 M. S. University of Alberta, Canada
September 1998 B. E. Bangalore University


Publication Type: Journal Article
Year of Publication Publication Type Title
2015 Journal Article M. Owaida, Falcao, G., Andrade, J., Antonopoulos, C., Bellas, N., Purnaprajna, M., Novo, D., Karakonstantis, G., Burg, A., and Ienne, P., “Enhancing design space exploration by extending CPU/GPU specifications onto FPGAs”, ACM Transactions on Embedded Computing Systems (TECS) , vol. 14, p. 33, 2015.[Abstract]

The design cycle for complex special purpose compute systems is extremely costly and time-consuming. It involves a multi-parametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation through time consuming Monte-Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as Low-Density Parity-Check (LDPC) codes, adopted by modern communication standards, which involves thousands of Monte-Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context we evaluate the concept of retargeting a single OpenCL program to multiple-platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g. 8000 bit) to large length (e.g. 64800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, providing different acceleration factors over conventional multicore CPUs.

More »»
2012 Journal Article M. Purnaprajna and Ienne, P., “Making wide-issue VLIW processors viable on FPGAs”, ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, p. 33, 2012.[Abstract]

Soft and highly-customized processors are emerging as a common way to efficiently control large amount of computing resources available on FPGAs. Yet, some processor architectures of choice for DSP and media applications, such as wide-issue VLIW processors, remain impractical: the multi-ported register file makes a very inefficient use of the resources in the FPGA fabric. This paper proposes modifications to existing FPGAs to make soft-VLIW processor viable. We introduce an embedded multi-ported RAM that can be customized to match the issue-width of VLIW processors. To ascertain the benefits of this approach, we map an extensible VLIW processor onto a standard FPGA from Xilinx. For the register file implemented in the modified FPGA, the area occupied is at least 102× smaller and the dynamic power is reduced by 41% as compared to the implementation using configurable logic blocks in existing standard FPGAs. A subset of this embedded multi-ported RAM can also be used for mapping the register file in soft-RISC processors. For the soft-RISC processor, the register file in the modified FPGA is at least 22× smaller than its equivalent that uses configurable logic blocks and 1.5× the size in comparison to the implementation using block RAMs. Reduction of routing area and the maximum net length is about 39% and 51% respectively for RISC processors. As a result, this approach works towards enhancing soft-processor density in FPGAs by orders of magnitude. More »»
2010 Journal Article M. Purnaprajna, Porrmann, M., Rueckert, U., Hussmann, M., Thies, M., and Kastens, U., “Runtime Reconfiguration of Multiprocessors Based on Compile-Time Analysis”, ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 3, p. 17, 2010.[Abstract]

In multiprocessors, performance improvement is typically achieved by exploring parallelism with fixed granularities, such as instruction-level, task-level, or data-level parallelism. We introduce a new reconfiguration mechanism that facilitates variations in these granularities in order to optimize resource utilization in addition to performance improvements. Our reconfigurable multiprocessor QuadroCore combines the advantages of reconfigurability and parallel processing. In this article, a unified hardware-software approach for the design of our QuadroCore is presented. This design flow is enabled via compiler-driven reconfiguration which matches application-specific characteristics to a fixed set of architectural variations. A special reconfiguration mechanism has been developed that alters the architecture within a single clock cycle.

The QuadroCore has been implemented on Xilinx XC2V6000 for functional validation and on UMC’s 90nm standard cell technology for performance estimation. A diverse set of applications have been mapped onto the reconfigurable multiprocessor to meet orthogonal performance characteristics in terms of time and power. Speedup measurements show a 2--11 times performance increase in comparison to a single processor. Additionally, the reconfiguration scheme has been applied to save power in data-parallel applications. Gate-level simulations have been performed to measure the power-performance trade-offs for two computationally complex applications. The power reports confirm that introducing this scheme of reconfiguration results in power savings in the range of 15--24%.

More »»
2009 Journal Article M. Purnaprajna, Porrmann, M., and Rueckert, U., “Run-time reconfigurability in embedded multiprocessors”, ACM SIGARCH Computer Architecture News, vol. 37, pp. 30–37, 2009.[Abstract]

To meet application-specific performance demands, architectures are predominantly redesigned and customised. Every architectural change results in huge overheads in design, verification, and fabrication, which together result in prolonged time-to-market. As an alternative, configurable architectures provide easy adaptability to different application domains in place of costly redesigns. To deal with application changes and custom requirements, a method of configuring and reusing the basic building blocks within processors is developed. Additionally, this enables co-operative multiprocessing. In this paper, a runtime reconfiguration mechanism for embedded multiprocessor architectures is proposed as a method to introduce customisations in the post-fabrication phase. A method of application description in conjunction with a flexible reconfigurable multiprocessor template is presented. Finally, the costs and benefits of this approach are analysed for computationally intensive algorithms used in digital signal processing. The impact of application specific characteristics on execution time, power consumption, and total energy dissipation are analysed.

More »»
2009 Journal Article M. Purnaprajna, Pohl, C., Porrmann, M., and Rueckert, U., “Using Run-time Reconfiguration for Energy Savings in Parallel Data Processing”, Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, ERSA'09, 2009.[Abstract]

Parallelism and adaptability are two distinct architectural design considerations in embedded processors. Multicore processors accelerate application execution on account of their inherent parallelism and run-time reconfiguration capabilities add adaptability during infield deployment. To benefit from both these features, a reconfigurable multiprocessor architecture − QuadroCore has been developed. A novel reconfiguration
mechanism has been incorporated that provides fast run-time adaptability in a 4-processor cluster. In this paper, this scheme of reconfiguration has been used to save energy when using QuadroCore for data-parallel applications. As a proof of concept, a data-intensive neural network application called Self-organising Maps has been implemented on QuadroCore. Via reconfiguration, energy reduction of up to 30% has been observed for an
implementation in UMC’s 90nm standard cell technology.

More »»
2007 Journal Article M. Purnaprajna, Reformat, M., and Pedrycz, W., “Genetic algorithms for hardware–software partitioning and optimal resource allocation”, Journal of Systems Architecture, vol. 53, pp. 339–354, 2007.[Abstract]

A scheme for time and power efficient embedded system design, using hardware and software components, is presented. Our objective is to reduce the execution time and the power consumed by the system, leading to the simultaneous multi-objective minimization of time and power. The goal of suitably partitioning the system into hardware and software components is achieved using Genetic Algorithms (GA). Multiple tests were conducted to confirm the consistency of the results obtained and the versatile nature of the objective functions. An enhanced resource constrained scheduling algorithm is used to determine the system performance. To emulate the characteristics of practical systems, the influence of inter-processor communication is examined. The suitability of introducing a reconfigurable hardware resource over pre-configured hardware is explored for the same objectives. The distinct difference in the task to resource mapping with the variation in design objective is studied. Further, the procedure to allocate optimal number of resources based on the design objective is proposed. The implementation is constrained for power and time individually, with GA being used to arrive at the resource count to suit the objective. The results obtained are compared by varying the time and power constraints. The test environment is developed using randomly generated task graphs. Exhaustive sets of tests are performed on the set design objectives to validate the proposed solution.

More »»
Publication Type: Conference Paper
Year of Publication Publication Type Title
2013 Conference Paper H. Parandeh-Afshar, Zgheib, G., Novo, D., Purnaprajna, M., and Ienne, P., “Shadow And-Inverter Cones”, in 23rd International Conference on Field Programmable Logic and Applications (FPL), 2013 , Porto, 2013.[Abstract]

Despite their many advantages, FPGAs still come with significant overheads in area, delay, and power consumption due to an extreme programmability in both the routing and logic. From the performance perspective, large logic blocks, capable of covering big portions of circuits, lead to fewer hops in the routing network, and thus, to a shorter critical path. Recent work has shown that And-Inverter Cones (AICs) can considerably reduce the number of logic block levels compared to Look-Up Tables (LUTs), in a radically altered FPGAs architecture. In this paper, we use AICs as shadow logic for LUTs, which incurs minimal architectural changes with respect to current FPGAs, while exploiting the benefits of both AICs and LUTs. We also propose changes in the AIC architecture, for a more compact technology mapping. The new architecture reduces the average circuit delay by up to 35% with respect to standard FPGAs at the expense of a 3x increase in the number of the logic clusters. Other benchmarks show more moderate area overheads, e.g., 16% delay improvement for 20% area overhead.

More »»
2013 Conference Paper M. Purnaprajna and Ienne, P., “A Case for Heterogeneous Technology-Mapping: Soft versus Hard Multiplexers”, in IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2013 , Seattle, WA, 2013.[Abstract]

Lookup table-based FPGAs offer flexibility but compromise on performance, as compared to custom CMOS implementations. This paper explores the idea of minimising this performance gap by using fixed, fine-grained, nonprogrammable logic structures in place of lookup tables (LUTs). Functions previously mapped onto LUTs can now be diverted to these structures, resulting in reduced LUT usage and higher operating speed. This paper presents a generic heterogeneous technology-mapping scheme for segregating LUTs and hard logic blocks. For the proof-of-concept, we choose to isolate multiplexers present in most general-purpose circuits. These multiplexers are mapped onto hard blocks of multiplexers that are present in existing commercial FPGA fabrics, but often unused. Since the hard multiplexers are already present, there is no additional performance or area penalty. Using this approach, an average reduction in LUT usage of 16% and an average speedup of 8% has been observed for the VTR benchmarks as compared to the LUTs-only implementation.

More »»
2013 Conference Paper H. Parandeh-Afshar, Zgheib, G., Novo, D., Purnaprajna, M., and Ienne, P., “Shadow AICs: Reaping the benefits of And-Inverter Cones with minimal architectural impact”, in Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, New York, NY, USA, 2013.[Abstract]

Despite their many advantages, FPGAs are still inefficient. This inefficiency is mainly due to programmable routing networks; however, FPGA logic blocks also have their share of contribution. From the performance perspective, fewer hops in the routing network translates to a shorter critical path; and that requires large logic blocks capable of covering big portions of circuits. Recent work has shown that And-Inverter Cones (AICs) can considerably reduce the number of logic block levels compared to Look-Up Tables (LUTs). The best performance is achieved when both AICs and LUTs are used, but the AIC implementation requires radical changes in the FPGAs architecture. In this paper, we use AICs as shadow logic for LUTs in LUT-clusters, which requires minimal architectural changes while exploiting the benefits of both AICs and LUTs. The basic idea is to reuse the input crossbar of LUT-clusters for the shadow AICs while combining both LUTs and AICs in the same cluster. We also propose changes in the AIC architecture to enhance mapping on AICs. Our experimental results indicate that the new cluster architecture can reduce the average circuit delay by 12% with respect to standard FPGA clusters. However, this performance gain comes at a price of 43% area overhead in terms of number of logic clusters. Our results show that for a modest 6% increase in area, FPGA manufacturers can move towards next-generation FPGA logic elements. This transition would provide faster design options without major architectural changes.

More »»
2009 Conference Paper M. Porrmann, Purnaprajna, M., and Puttmann, C., “Self-optimization of mpsocs targeting resource efficiency and fault tolerance”, in NASA/ESA Conference on Adaptive Hardware and Systems, 2009. AHS 2009. , San Francisco, CA, 2009.[Abstract]

A dynamically reconfigurable on-chip multiprocessor architecture is presented, which can be adapted to changing application demands and to faults detected at run-time. The scalable architecture comprises lightweight embedded RISC processors that are interconnected by a hierarchical network-on-chip (NoC). Reconfigurability is integrated into the processors as well as into the NoC with minimal area and performance overhead. Adaptability of the architecture relies on a self-optimizing reconfiguration of the MPSoC at run-time. The resource-efficiency of the proposed architecture is analyzed based on FPGA and ASIC prototypes.

More »»
2008 Conference Paper M. Purnaprajna, Puttmann, C., and Porrmann, M., “Power aware reconfigurable multiprocessor for elliptic curve cryptography”, in Design, Automation and Test in Europe, 2008. DATE'08, Munich, 2008.[Abstract]

Reconfigurable architectures are being increasingly used for their flexibility and extensive parallelism to achieve accelerations for computationally intensive applications. Although these architectures provide easy adaptability, it is so with an overhead in terms of area, power and timing, as compared to non-reconfigurable ASICs. Here, we propose a low overhead reconfigurable multiprocessor, which provides both parallelism and flexibility. The architecture has been evaluated for its energy efficiency for a computational intensive algorithm used in elliptic curve cryptography (ECC). Typically, algorithms in ECC exhibit task-level parallelism and demand large amount of computational resources for custom implementations to achieve a significant speedup. A finite field multiplication in GF(2233) was chosen as a sample application to evaluate the performance on the QuadroCore reconfigurable multiprocessor architecture. A three-fold performance improvement as compared to a single processor implementation was observed. Further, via reconfiguration to suit the application, power savings of about 24% were noted in UMC's 90 nm standard cell technology.

More »»
2008 Conference Paper M. Purnaprajna and Porrmann, M., “Run-time Reconfigurable Cluster of Processors”, in Proceedings of 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), Workshop on Design, Architecture and Simulation of Chip Multi-Processors, IEEE Computer Society, Lake Como, Italy, 2008.[Abstract]

High performance requirements often necessitate redesigns for every new application, resulting in long time-tomarket. Every architectural change involves costs in terms of hardware design, verification and fabrication. As an alternative, architectural flexibility provides easy adaptability to different application domains in order to avoid the high cost of redesigns. Hence, a method of reusing the basic building blocks within processors to enable co-operative multiprocessing is proposed. Runtime reconfiguration is used as a method for application-specific customisation. Here, a method of application description in conjunction with a flexible multiprocessor template is proposed. Finally, the costs and benefits of this approach are analysed for a computationally intensive algorithm in terms of execution time and power consumption. The impact of variations in applicationspecific characteristics on the proposed architecture, are also analysed.

More »»
2007 Conference Paper M. Hußmann, Thies, M., Kastens, U., Purnaprajna, M., Porrmann, M., and Rückert, U., “Compiler-driven reconfiguration of multiprocessors”, in Proceedings of the Workshop on Application Specific Processors (WASP) 2007, Salzburg, Austria, 2007.[Abstract]

Multiprocessors enable parallel execution of a single large application to achieve a performance improvement. An application is split at instruction, data or task level (based on the granularity), such that the overhead of partitioning is minimal. Parallelization for multiprocessors is mostly restricted to a fixed granularity. Reconfiguration enables architectural variations to allow multiple granularities of operation within a multiprocessor. This adaptability optimizes resource utilization over a fixed organization. Here, a unified hardware-software approach to design a reconfigurable multiprocessor system called QuadroCore is presented. In our holistic methodology, compiler-driven reconfiguration selects from a fixed set of modes. Each mode relies on matching program analysis to exploit the architecture efficiently. For instance, a multiprocessor may adapt to different parallelization paradigms. The compiler can determine the best execution mode for each piece of code by analyzing the parallelism in a program. A fast, singlecycle, run-time reconfiguration between these predetermined modes is enabled by executing special instructions which switch coarse-grained components like instruction decoders, ALUs and register banks. Performance is evaluated in terms of execution cycles and achieved clock frequency. First results indicate suitability especially in audio and video processing applications.

More »»
2004 Conference Paper E. Fung, Leung, K., Parimi, N., Purnaprajna, M., and Gaudet, V. C., “ASIC implementation of a high speed WGNG for communication channel emulation [white Gaussian noise generator]”, in IEEE Workshop on Signal Processing Systems, 2004. SIPS 2004. , 2004.[Abstract]

A design for a white Gaussian noise generator (WGNG) is modified and implemented as a 0.18-μm CMOS digital ASIC for high-speed communication channel emulation. The original design was implemented using an FPGA. The goal of the work presented is to enhance the performance of the WGNG in order to achieve emulation of high-speed communication standards unattainable by the FPGA implementation. This is accomplished by pipelining the original design and implementing it using an ASIC. A layout is generated, based on a standard digital design flow provided by the Canadian Microelectronics Corporation (CMC). This implementation achieves an output rate of 182 Msamples/sec, which exceeds the speed of the original FPGA implementation by more than seven times.

More »»
Publication Type: Thesis
Year of Publication Publication Type Title
2010 Thesis M. Purnaprajna, “Run-time reconfigurable multiprocessors”, University of Paderborn, 2010.[Abstract]

The advantage in multiprocessors is the performance speedup obtained with processorlevel parallelism. Similarly, the flexibility for application-specific adaptability is the advantage in reconfigurable architectures. To benefit from both these architectures, we present a reconfigurable multiprocessor template that combines parallelism in multiprocessors and flexibility in reconfigurable architectures. A fast, single cycle, resource efficient, run-time reconfiguration scheme accelerates customizations in the reconfigurable multiprocessor template. Based on this methodology, a four core multiprocessor called QuadroCore has been implemented on UMC's 90nm standard cells and on Xilinx's FPGA. Quadrocore is customisable and adapts to variations in granularity of parallerlism, the amount of communication between tasks, and the frequency of synchronization. To validate the advantages of this approach, a diverse set of applications has been mapped onto the QuadroCore multiprocessor. Experimental results show speedups in the range of 3 to 11 in comparison to a single processor. In addition, energy savings upto 30% were noted on account of reconfiguration. Furthermore, to steer application mapping based on power considerations, an instruction-level power model has been developed. Using this model, power-driven instruction selection introduces energy savings of upto 70% in the QudroCore multiprocessor.

More »»
2006 Thesis M. Purnaprajna, “Evolutionary Optimization Techniques and Reconfigurable Hardware”, University of Alberta, 2006.[Abstract]

A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Science, Dept. of Electrical and Computer Engineering, University of Alberta. Thesis (M.Sc.)--University of Alberta, 2005. Includes bibliographical references.

More »»
Faculty Details


Faculty Email: