Parallel Database design, query processing

Parallel Database design, query processing, and case study In this post, we are going to analyze the concepts of Parallel Database design, query processing, and case study.
  • Parallel database design
    • Hardware Architecture
    • Data partitioning
  • Query processing

Hardware Architecture

  • Shared Memory
  • Shared Disk>
  • Shared Nothing

Data Partitioning

  • Partitioning a relation involves distributing its tuples over several disks
  • Three Kinds –
    1. Range Partitioning
    2. Round-robin Partitioning
    3. Hashing Partitioning

Range Partitioning is good for Ideal for point and range queries on the partitioning attribute

Hash partitioning

  • Ideal for point queries based on the partitioning attribute
  • Ideal for sequential scans of the entire relationship
  • Not ideal for point queries on non-partitioning attributes
Parallel query processing = automatic translation of a query into an efficient execution plan + Its parallel execution
  1. The translation must be correct
  2. The execution of the plan produces the expected result
  3. The execution plan must be optimal
  4. It minimizes a cost function that captures resource consumption

Block diagram for parallel query processing

JOQR: Join Ordering and Query Rewrite
  • Translation - Translation of the relational algebra expression to a query tree
  • Optimization - Reordering of join operations in the query tree and choosing among different join algorithms to minimize the cost of the execution
  • Parallelization - Transforming the query tree to a physical operator tree and loading the plan to the processors
  • Execution - Running the concurrent transactions

Parallel execution control strategies

  1. Control flow
    • A single control node controls all processes related to a query
    • It starts all processes and synchronizes them
    • Scheduling is performed by the control node
  2. Data flow
    • No centralized control
    • Processes on different nodes trigger each other with data messages
    • Data-driven execution: if enough input is available the process starts automatically

Data flow

  • Pipelined parallelism: computation of one operator proceeds in parallel with another
  • Partitioned parallelism: operators are replicated for each data source, and the replicas execute in parallel

Advantages of data flow control

  • reduces the control messages transferred through the network
  • reduces the workload of a particular node (the control node)
  • provides pipelining naturally

Disadvantages of data flow control

  • it means more asynchronous activity
  • more competition for a processor
  • higher protocol complexity
  • providing data allocation information to all nodes is difficult

Interquery Parallelism

  • In interquery parallelism, different queries or transactions execute in parallel with one another
  • Transaction throughput can be increased by this form of parallelism
  • Interquery parallelism is the easiest form of parallelism to support in a database system-particularly in a shared-memory parallel system
  • Supporting interquery parallelism is more complicated in a shared-disk or shared nothing architecture
  • Processors have to perform some tasks, such as locking and logging, in a coordinated fashion, and that requires that they pass messages to each other
  • The database system must ensure that the processor has the latest version of the data in its buffer pool
  • The problem of ensuring that the version is the latest is known as the cache-coherency problem
  • Cache-coherency protocol for a shared-disk system is
    1. Before any read or write access to a page, a transaction locks the page in shared or exclusive mode, as appropriate
    2. Immediately after the transaction obtains either a shared or exclusive lock on a page, it also reads the most recent copy of the page from the shared disk
    3. Before a transaction releases an exclusive lock on a page, it flushes the page to the shared disk then it releases the lock

Intra Query Parallelism

  • Intraquery parallelism refers to the execution of a single query in parallel on multiple processors and disks
  • It is important for speeding up long-running queries because each query is run sequentially
  • The execution of a single query can be parallelized in two ways:
    1. Intraoperation parallelism: We can speed up the processing of a query by parallelizing the execution of each operation, such as sort, select, project, and join
    2. Interoperation parallelism: We can speed up the processing of a query by executing in parallel the different operations in a query expression

Global architecture of database server

  • Queries issued from the user applications are compiled and optimized at the front-end machine to produce the executable code
  • Execution of a query is initiated by the coordinator program. The coordinator broadcasts the generated code to each PC node, where a server process for query execution is running
  • The generated code contains only parts specific to the corresponding query, such as the initialization and finalization code according to the execution plan, and evaluation code for predicates in the query

Association rule mining

  • The most well known algorithm for association rule mining is the Apriori algorithm
  • One of parallel algorithms for mining association rules is HPA (Hash Partitioned Apriori)
  • HPA partitions the candidate itemsets among the processing nodes using a hash function as in the parallel hash join, which eliminates broadcasting of all the transaction data and can reduce the comparison workload significantly
  • Hence, HPA works much better than the naive parallelization for large scale data mining

kth iteration of algorithm

  1. Generate the candidate itemsets:
  2. Each processing node generates new candidate itemsets from the large itemsets of the last (k-1)th iteration. Each former itemset contains k items, while each of the latter itemsets contains k-1 items. The processing node applies the hash function to each of the candidates to determine the destination node ID. If the candidate is for the processing node itself, it is inserted into the hash table, otherwise it is discarded
  3. Scan the transaction database and count the support count: Each processing node reads the transaction database from its local disk. K-itemsets are generated from that transaction and the same hash function used in phase 1 is applied to each of them. Each of k itemsets is sent to certain processing node according the hash value. For the itemsets received from the other nodes and those locally generated whose ID equals the node’s ID, the hash table issearched. If hit, its support count value is incremented
  4. Determine the large itemset:
  5. After reading all the transaction data, each processing node can individually determine whether each candidate k-itemset satisfy user-specified minimum support or not. Each processing node sends large k-itemsets to the coordinator, where all the large k-itemsets are gathered
  6. Check the terminal condition:
  7. If the large k-itemsets are empty, the algorithm terminates. Otherwise, the coordinator broadcasts large k-itemsets to all the processing nodes and the algorithm enters the next iteration

The execution time of the HPA program (pass 2) is shown in figure as the number of PCs is changed. The maximum number of PCs used in this evaluation is 100. Reasonably good speedup is achieved in this application as the number of PCs is increased.

Comments

Popular posts from this blog

XPath for HTML markup

Apache Hadoop | Running MapReduce Jobs

Laravel | PHP | Basics | Part 2