Parallel Database design, query processing

Parallel Database design, query processing, and case study In this post, we are going to analyze the concepts of Parallel Database design, query processing, and case study.

Parallel database design
- Hardware Architecture
- Data partitioning
Query processing

Hardware Architecture

Shared Memory

Shared Disk>

Shared Nothing

Data Partitioning

Partitioning a relation involves distributing its tuples over several disks
Three Kinds –
1. Range Partitioning
2. Round-robin Partitioning
3. Hashing Partitioning

Range Partitioning is good for Ideal for point and range queries on the partitioning attribute

Hash partitioning

Ideal for point queries based on the partitioning attribute
Ideal for sequential scans of the entire relationship
Not ideal for point queries on non-partitioning attributes

Parallel query processing = automatic translation of a query into an efficient execution plan + Its parallel execution

The translation must be correct
The execution of the plan produces the expected result
The execution plan must be optimal
It minimizes a cost function that captures resource consumption

Block diagram for parallel query processing

JOQR: Join Ordering and Query Rewrite

Translation - Translation of the relational algebra expression to a query tree
Optimization - Reordering of join operations in the query tree and choosing among different join algorithms to minimize the cost of the execution
Parallelization - Transforming the query tree to a physical operator tree and loading the plan to the processors
Execution - Running the concurrent transactions

Parallel execution control strategies

Control flow
- A single control node controls all processes related to a query
- It starts all processes and synchronizes them
- Scheduling is performed by the control node
Data flow
- No centralized control
- Processes on different nodes trigger each other with data messages
- Data-driven execution: if enough input is available the process starts automatically

Data flow

Pipelined parallelism: computation of one operator proceeds in parallel with another
Partitioned parallelism: operators are replicated for each data source, and the replicas execute in parallel

Advantages of data flow control

reduces the control messages transferred through the network
reduces the workload of a particular node (the control node)
provides pipelining naturally

Disadvantages of data flow control

it means more asynchronous activity
more competition for a processor
higher protocol complexity
providing data allocation information to all nodes is difficult

Interquery Parallelism

In interquery parallelism, different queries or transactions execute in parallel with one another
Transaction throughput can be increased by this form of parallelism
Interquery parallelism is the easiest form of parallelism to support in a database system-particularly in a shared-memory parallel system
Supporting interquery parallelism is more complicated in a shared-disk or shared nothing architecture
Processors have to perform some tasks, such as locking and logging, in a coordinated fashion, and that requires that they pass messages to each other
The database system must ensure that the processor has the latest version of the data in its buffer pool
The problem of ensuring that the version is the latest is known as the cache-coherency problem
Cache-coherency protocol for a shared-disk system is
1. Before any read or write access to a page, a transaction locks the page in shared or exclusive mode, as appropriate
2. Immediately after the transaction obtains either a shared or exclusive lock on a page, it also reads the most recent copy of the page from the shared disk
3. Before a transaction releases an exclusive lock on a page, it flushes the page to the shared disk then it releases the lock

Intra Query Parallelism

Intraquery parallelism refers to the execution of a single query in parallel on multiple processors and disks
It is important for speeding up long-running queries because each query is run sequentially
The execution of a single query can be parallelized in two ways:
1. Intraoperation parallelism: We can speed up the processing of a query by parallelizing the execution of each operation, such as sort, select, project, and join
2. Interoperation parallelism: We can speed up the processing of a query by executing in parallel the different operations in a query expression

Global architecture of database server

Queries issued from the user applications are compiled and optimized at the front-end machine to produce the executable code
Execution of a query is initiated by the coordinator program. The coordinator broadcasts the generated code to each PC node, where a server process for query execution is running
The generated code contains only parts specific to the corresponding query, such as the initialization and finalization code according to the execution plan, and evaluation code for predicates in the query

Association rule mining

The most well known algorithm for association rule mining is the Apriori algorithm
One of parallel algorithms for mining association rules is HPA (Hash Partitioned Apriori)
HPA partitions the candidate itemsets among the processing nodes using a hash function as in the parallel hash join, which eliminates broadcasting of all the transaction data and can reduce the comparison workload significantly
Hence, HPA works much better than the naive parallelization for large scale data mining

kth iteration of algorithm

Generate the candidate itemsets:
Each processing node generates new candidate itemsets from the large itemsets of the last (k-1)th iteration. Each former itemset contains k items, while each of the latter itemsets contains k-1 items. The processing node applies the hash function to each of the candidates to determine the destination node ID. If the candidate is for the processing node itself, it is inserted into the hash table, otherwise it is discarded
Scan the transaction database and count the support count: Each processing node reads the transaction database from its local disk. K-itemsets are generated from that transaction and the same hash function used in phase 1 is applied to each of them. Each of k itemsets is sent to certain processing node according the hash value. For the itemsets received from the other nodes and those locally generated whose ID equals the node’s ID, the hash table issearched. If hit, its support count value is incremented
Determine the large itemset:
After reading all the transaction data, each processing node can individually determine whether each candidate k-itemset satisfy user-specified minimum support or not. Each processing node sends large k-itemsets to the coordinator, where all the large k-itemsets are gathered
Check the terminal condition:
If the large k-itemsets are empty, the algorithm terminates. Otherwise, the coordinator broadcasts large k-itemsets to all the processing nodes and the algorithm enters the next iteration

The execution time of the HPA program (pass 2) is shown in figure as the number of PCs is changed. The maximum number of PCs used in this evaluation is 100. Reasonably good speedup is achieved in this application as the number of PCs is increased.

Search This Blog

Web Technology Tutorials