Parallel Object Query System for Expensive Computations (POQSEC)
This work was funded by the Swedish Research Council.
Project description
Exceptionally large amounts of distributed data and computational resources will be available through the GRID. Many modern applications within, e.g. engineering, bioinformatics, neuroscience, music, space physics, etc. require scalable data access. They also require representation of not only traditional tabular databases but also other data representations, such as numerical data structures. Very demanding and memory-intensive computations need to be done over these large amounts of data. Another important issue is that it that the user should be able to transparently utilize the resources required for an analysis without having to manually partition data access and computations.
The goal of this project is to achieve high
performance for scientific queries utilizing the operational Grid
infrastructure NorduGrid.
A data manager and customizable query processor is being developed that
allows transparent and efficient execution of database queries
utilizing NorduGrid. The exections can access data from storage
elements and wrapped external systems on the Grid. The system
will have support for customizable data representations, allow
user-defined long-running distributed computations in queries, and
access
conventional relational databases. It will process application specific
code
on both local data and data distributed through the Grid.
NorduGrid is
a distributed peer oriented Grid middleware system that does not rely
on
a central broker. Computer clusters accessed through NorduGrid have
certain restrictions with respect to resource allocation,
communication, and process management that the POQSEC architecture must
cope with and this influences its architecture.
The POQSEC data manager and query processor scales up by utilizing the Grid to transparently and dynamically incorporate new nodes and clusters for the combined processing of data and computations as the database and application demands grow. Conventional databases and file-based Grid storage elements are used as back-ends for data repositories. Extensible and object-oriented query processing and rewrite techniques are used to efficiently combine distributed data and computations in this environment.
The POQSEC prototype being developed uses as test cases data and queries from Particle Physics where large amounts of data describing particle events are produced by proton-proton collisions. The queries involve regular data comparisons and aggregation operators along with user defined filter operations in terms C++ based computational libraries, e.g. the ROOT library. A single such analysis of a single dataset of size 1 million events often takes more than 1 hour to execute on a single machine. Thus, the processing needs to scale up to cover all distributed data produced by LHC.
POQSEC utilizes the AMOS II database management system that provides object-relational DBMS functionality, peer to peer communication, declarative query language AmosQL, and interfaces to C++ and Java. The kernel is being extended in order to implement the architecture.
Resources
We use various computational resources to test and evaluate our system prototypes. Most of them are provided by Swedish National Infrastructures for Computing (SNIC) namely Swegrid, HPC2N, and UPPMAX resources. We use also other resources available through NorduGrid, mostly located in Sweden, Danmark, Finland, and Norway.
Publications
- R.Fomkin and T.Risch: Cost-based Optimization of Complex
Scientific Queries, Accepted for publication at 19th International
Conference on Scientific and Statistical Database Management (SSDBM
2007) , Banff, Canada, July 9-11, 2007,
- R.Fomkin and T.Risch: Framework for Querying Distributed Objects Managed by a Grid Infrastructure, 1st International Workshop on Data Management in Grids (DMG'05), Trondheim, Norway, September 2-3, 2005.
- R.Fomkin and T.Risch: Managing Long Running Queries in Grid Environment, 1st Intl. Workshop on GRID Computing and its Applications to Data Analysis (GADA'04), Lacarna, Cyprus, Oct. 2004, in R. Meersman et al. (Eds.): OTM Workshops 2004, LNCS 3292, pp. 99–110, 2004
People
Responsible for this project is Tore Risch. It is the basis for the PhD work of Ruslan Fomkin.
Acknowledgments
We would like to thank Christian Hansson and professor Tord Ekelöf from the department of Radiation Science, Uppsala University, for providing the Particle Physics analysis application and data as system test cases. The secure communication is implemented by Mehran Ahsant from Center for Parallel Computers, Royal Institute of Technology, Stockholm. The support from the NorduGrid project and from Åke Sandgren (HPC2N) and Tore Sundqvist (UPPMAX) is significant for the system implementation and testing.Last
update: 24/03/2005. Responsible: Tore Risch
Copyright © 2005 Uppsala University, Department of Information
Technology