Scientific Workflow Applications (SWFAs) can deliver collaborative tools useful to researchers in executing large and complex scientific processes. Particularly, Scientific Workflow Scheduling (SWFS) accelerates the computational procedures between the available computational resources and the dependent workflow jobs based on the researchers’ requirements. However, cost optimization is one of the SWFS challenges in handling massive and complicated tasks and requires determining an approximate (near-optimal) solution within polynomial computational time. Motivated by this, current work proposes a novel SWFS cost optimization model effective in solving this challenge. The proposed model contains three main stages: (i) scientific workflow application, (ii) targeted computational environment, and (iii) cost optimization criteria. The model has been used to optimize completion time (makespan) and overall computational cost of SWFS in cloud computing for all considered scenarios in this research context. This will ultimately reduce the cost for service consumers. At the same time, reducing the cost has a positive impact on the profitability of service providers towards utilizing all computational resources to achieve a competitive advantage over other cloud service providers. To evaluate the effectiveness of this proposed model, an empirical comparison was conducted by employing three core types of heuristic approaches, including Single-based (i.e., Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Invasive Weed Optimization (IWO)), Hybrid-based (i.e., Hybrid-based Heuristics Algorithms (HIWO)), and Hyper-based (i.e., Dynamic Hyper-Heuristic Algorithm (DHHA)). Additionally, a simulation-based implementation was used for SIPHT SWFA by considering three different sizes of datasets. The proposed model provides an efficient platform to optimally schedule workflow tasks by handling data-intensiveness and computational-intensiveness of SWFAs. The results reveal that the proposed cost optimization model attained an optimal Job completion time (makespan) and total computational cost for small and large sizes of the considered dataset. In contrast, hybrid and hyper-based approaches consistently achieved better results for the medium-sized dataset.
Effective management of Scientific Workflow Scheduling (SWFS) processes in a cloud environment remains a challenging task when dealing with large and complex Scientific Workflow Applications (SWFAs). The SWFA is the first stage of the Scientific Workflow Scheduling (SWFS) process and requires users (i.e., scientists) to specify the nature of their data. In other words, the SWFA receives user preferences regarding the task execution order that is carried out in the computational environment stage. Users’ preferences include the precedence constraints of tasks, job completion time, and Total Computational Cost (TCC). A number of inputs are also required from users to successfully perform SWFS. These main inputs are inter-dependent tasks (e.g., programs) associated with their input data (e.g., images), along with the scripts, catalogs, and Application Program Interface (API) written using various programming languages to represent the dependencies of the submitted workflow tasks. The expected outcome of SWFS (from the users’ view) is the statistical and analytical data obtained from executing the workflow tasks. Thus, SWFAs present several advantages to users, such as simplifying the process for scientists to reuse similar workflows. Ultimately, SWFAs provide scientists with a supportive (i.e., easy-to-use) environment in which to track and virtually share obtained optimization results. The second major stage of the SWFS process is the Workflow Management System (WfMS) stage. Generally, IT staff manually execute the workflow tasks, which requires prior knowledge about two core elements: (i) available resources and (ii) estimated starting time of each workflow task [
To overcome the above-mentioned challenges, this paper aims to propose a cost optimization model for SWFS that mainly focuses on managing the execution processes of the given dependent workflow tasks (also referred to as precedence constraints) of SWFAs. The model then schedules the submitted tasks onto the targeted shared computational resources, that is, Virtual Machines (VMs). At the same time, this model addresses optimizing the job completion time and TCC. The model contains three main components: (i) scientific workflow application (type and size of SWFA), (ii) targeted computational environment (number of VMs), and (iii) cost optimization criteria. The model is beneficial in mapping and scheduling processes of workflow tasks by considering the scheduling stages along with completion time and total computational cost parameters. Generally speaking, to propose a cost-optimal solution for a given SWFS problem, the completion time (makespan) and total computational cost of workflow tasks need to be simultaneously minimized as much as possible.
Current reports on cost optimization tasks of SWFS used a number of approaches, including ones that are heuristic and meta-heuristic based. Yet, these contemporary SWFS approaches lack in providing an empirical evaluation primarily based on the variants of heuristic approaches. Thus, to comprehensively develop and evaluate the performance of any cost optimization approach of SWFS in a cloud computing environment, three types of heuristic approaches, Single-based, Hybrid-based, and Hyper-based, have been considered in this research context. The evaluation for the proposed approach is conducted through a WorkflowSim simulator. The simulation-based environment easily determines different scenarios (e.g., size of a given SWFA and number of available VMs) to allow for a full investigation of the performance of the proposed model.
The major contributions of this research work are:
A proposed novel cost optimization model that contains three main stages: (i) scientific workflow application, (ii) targeted computational environment, and (iii) cost optimization criteria and that classifies the cost parameters into the categories of pre-scheduling, during scheduling, and post-scheduling. An empirical comparison based on the different types of heuristic approaches, Single-based, Hybrid-based, and Hyper-based, that considers a number of VMs and various sizes of SWFA. An extensive review and analysis of the existing approaches based on several perspectives, including the types of existing experimental cloud environments, number of computational resources, types of computational resources, types of SWFAs, and the average size of the considered SWFA datasets.
This paper is structured as follows: Section 2 provides related works. Section 3 presents the proposed cost optimization model of SWFS as well as it describes the conducted empirical comparison based on the considered heuristic approaches. After that Section 4 explicitly discusses the results and evaluation using a simulation-based experimentation environment. Finally, the conclusion and future work is outlined in Section 5.
In order to assess the quality of the proposed model (i.e., cost optimization of SWFS in a cloud environment), relevant studies were reviewed to identify parameter values for each defined attribute and to construct a comparative table. The main identified attributes were the tool environment, environment type, number and type of computational resources, and size and type of SWFA tasks. Note that the Amazon instance specification is regarded as the standard [
Ref. | Title of the proposed approach | Name and type of approach | Type and number of resources | Type of SWFA | No. of tasks | |
---|---|---|---|---|---|---|
[ |
A market-oriented hierarchical scheduling strategy in cloud workflow systems | SwinDeW-C based on Hadoop—Real-world experiment (empirical study, single-based heuristic) | t2.medium, t2.large (30) | Montage, CyberShake, Epigenomics, LIGO and SIPHT | Small-medium | |
[ |
Deadline-constrained workflow scheduling algorithms for infrastructure as a service cloud | Developed tool in Java—Simulator (Single-based heuristic) | t2.small, t2.medium, t2.large, S3 (9) | Montage, CyberShake, Epigenomics, LIGO and SIPHT | Small-Medium-Large | |
[ |
Compatibility of hybrid process scheduler in green it clouds computing environment | CloudSim—Simulator (open source, Hybrid-based heuristic) | t2.micro, t2.small, t2.medium, t2.large (9) | Montage, CyberShake, Epigenomics, LIGO and SIPHT | Small | |
[ |
On workflow scheduling for end-to-end performance optimization in distributed network environments. | Fair-share scheduling policy in C++ —Simulator, and real-world experiment (single-based heuristic) | t2.medium (6) | Weather research and forecasting (WRF) | Small-medium-large | |
[ |
Ant colony optimization-based service flow scheduling with various QoS requirements in cloud computing | Developed tool in Java—Simulator (single-based heuristic) | t2.medium (10) | Montage, CyberShake, Epigenomics, LIGO and SIPHT | Small-medium-large | |
[ |
Workflow scheduling for SaaS/PaaS cloud providers considering two SLA levels. | Java and IBM ILOG CPLEX Optimizer—Real-world experiment (single-based heuristic) | M3.2 extra-large (4) | Montage fork-join DAG | Small | |
[ |
Workflow scheduling to minimize data movement using multi-constraint graph partitioning | Pwrake workflow system, InTrigger Kore—Real-world experiment (Empirical study, Single-based heuristic) | M4.extra-large, M4.10 extra-large (8) | Montage 2MASS | Large | |
[ |
HCOC: A cost optimization algorithm for workflow scheduling in hybrid clouds | Amazon Elastic Compute Cloud—Real-world experiment (Empirical study, Hybrid-based heuristic) | t2.small, t2.large, M4.extra-large (3) | Montage, AIRSN, CSTEM, LIGO-1 and LIGO-2, Chimera-1, | Small-medium | |
[ |
Towards a cost model for scheduling scientific workflows activities in Cloud | SciCumulus—Developed tool in Java (single-based heuristic) | M4.extra-large (8) | DNA sequences | Small-medium |
Most of the considered SWFAs have different types of task dependencies. The main types are pipeline, data distribution and redistribution, process, and data aggregation. Hence, it is important to accurately measure the data-intensiveness and computational-intensiveness related to the performance of the various SWFS approaches. To achieve this, there is a high demand to compare various sizes of SWFA tasks. For more reviews in the area of SWFS, please refer to the previously conducted work of the authors [
To propose a cost-optimal solution for a given SWFS problem, the completion time (makespan) and total computational cost of workflow tasks need to be simultaneously minimized as much as possible. Therefore, there is a need to apply a Pareto-optimal solution method, which considers many solutions in the feasible region rather than focusing on just one. Based on the nature of the optimization problem, there are two main types of objectives: minimizing or maximizing the optimization problem [
Furthermore, the proposed model has classified the cost parameters into three categories: (i) pre-scheduling, (ii) during the scheduling, and (iii) post-scheduling. At the Pre-Scheduling stage, the main responsibility of the scheduler is to check whether the given SWFS tasks are schedulable, based on a set of attributes. During the Scheduling stage, it is of vital importance to check the ready time parameter. Notice that the ready time is the time by which all of the data required by the tasks have reached the scheduled virtual machine (VM), that is, the computational site. Finally, at the Post-Scheduling stage, once all of the given tasks have been scheduled on the available virtual machines, the time between the
In the subsequent sections, the details of each element of the proposed model are provided.
SWFAs are regarded as data and computationally intensive workflow applications, which broadly speaking, process, and execute data flows collectively with task execution. They consist of a couple of tasks required to successfully accomplish a specific workflow. The aspects of these tasks can be any executable elements (e.g., load sets, document sets, data, and programs) with distinct structural dependencies, such as process, pipeline, records distribution, facts aggregation, and data redistribution. SWFAs include a number of input scripts, such as scientific software along with their structured data. Notice that the input scripts are used to generate, analyze, and visualize the obtained results (
The output of SWFAs presents interactive tools that assist service consumers to better execute the given workflows. Moreover, it also supports in visualizing the obtained results in a legitimate time. Also, of note is that SWFAs streamline the execution procedures for scientists, which is useful in reusing identical workflows. Furthermore, SWFAs present a highly usable environment for tuning and sharing outcomes in a virtual environment. Conversely, a high level of dependency between the workflow duties remains a distinct mission of SWFAs. This mainly occurs because of the task’s precedence related limitations. Thus, SWFAs need extra computational resources to effectively determine an optimum SWFS solution for large statistics and complex tasks. Due to the complex nature of SWFAs, a structural illustration is required to simplify the submitted workflow tasks. Hence, it is essential to operate the modeling for the submitted jobs, along with their precedence constraints, using preferred notations. In the literature, numerous variants of structural illustration strategies have been adopted to represent the tasks’ dependency on an SWFA. One of the popular strategies is a Direct Acyclic Graph (DAG), as it can handle highly complex SWFA tasks. In essence, the DAG highlights the workflow’s estimated execution cost, keeping in view available resources and remaining solely grounded on the historical statistics of the WfMS. It also shows the verbal exchange between the estimated resources.
Mapping is required between the submitted set of tasks and the available set of homogeneous/heterogeneous computational resources that help in successfully executing the workflow tasks using a cloud-computing requirement. Generally, the computational resources are regarded as a set of available Virtual Machines (VMs) in the cloud computing context.
In order to recommend a cost-optimal solution for the SWFS problem, the completion time (makespan) and the total computational fee of workflow tasks have been regarded to be minimized. The cost optimization of SWFS can be accomplished by means statically and simultaneously minimizing the execution time and execution cost. Thus, there is a need to apply a Pareto-optimal method. The underlying concept of the Pareto-optimal approach is to think about many options in the possible area rather than focusing on a single solution. Based on the nature of the optimization problem, there are two kinds of objectives, which are minimizing or maximizing the optimization problem. In this research, minimizing the completion time (makespan) of the submitted workflow tasks and total execution cost of workflow tasks has been viewed as an optimization problem. As shown in
The scheduler in this stage initially focuses on determining whether the given SWFS tasks are schedulable or not. Suppose, if the given tasks are schedulable, then the scheduler performs a selection grounded on a set of parameters. The parameters are (i) types of in-hand processors, (ii) available processors, (iii) busy processors, (iv) title of the computational site, and (v) estimated tasks execution time. To conduct optimal scheduling, the scheduler performs the crucial decisions based on the historical data and the available computational resources. Thus, there is a need to keep the historical data, which is necessary to estimate the feasible execution time.
It is important to devise a mechanism for assigning every computational challenge to each of the available virtual computational devices for each estimated cost weight. Motivated by this, the scheduler uses a computational cost matrix (w) having facets
Likewise, a communication cost matrix (
Suppose there is a scenario in which the scheduler maps (schedules) both tasks, say
where
Based on the historical information, and to correctly begin the scheduling processes, it is integral to reflect on the Earliest Start Time (EST) and Earliest Finish Time (EFT) of the execution processes. The EST is described as the earliest time to provoke the challenge execution on the available VM. However, it remains a difficult task to determine the EST, especially in a heterogeneous environment. This is mainly due to the fact that a task computational time of a specific cloud differs inside each available VM. On the other hand, a challenge requires a specific amount of time; thus, it cannot be scheduled before the EST and must be finished on the EFT. The formal representation of EST for each of the unscheduled workflows tasks is given as follows in
Note that the start of the workflow is indicated by the task
where
EFT refers to the Earliest Finish Time, during which the earliest computation of every unscheduled task
It is important to check the ready time parameter in this scheduling stage. The ready time is the time by which all the data needed by the tasks have reached (after the parent(s) node has/have been executed) the scheduled virtual machine (computational site).
In comparison to EST, ready time is the earliest time required for the first workflow task to be accomplished. The first task (assignment) is solely selected based on the parent workflow tasks. The ready time is formally calculated using
where
Generally, a workflow task
Once the submitted tasks are scheduled, the start time of the parent task will be used as a deadline for different dependent tasks. At this point, the scheduling algorithm considers two main possibilities for the workflow task. The first scenario presumes that the workflow start time is the EST that is known prior to its scheduling. The second scenario assumes that the actual start time can only be calculated once the tasks are scheduled on a virtual machine. The actual finish time represents the time that has been used to whole the execution of the submitted workflow task. In other words, after all the tasks are scheduled, the scheduling size (completion time) is represented as the actual end time of the exit
The traditional population-based meta-heuristic approaches have shown good performance for the optimization problem, having a large search space. This is because the traditional meta-heuristics do not exhaustively search the scheduling problem space. They use different underlying strategies to find the desired solution based on defined fitness criteria. Therefore, population-based (e.g., random-guided) methods take less computational effort than single-based ones and can often find good solutions. However, each type of approach has some strengths and limitations, which affect the scheduling operation of SWFS.
In contrast, the hybrid meta-heuristic uses the best features of two or more traditional meta-heuristics in each iteration to provide a better optimal solution compared to the traditional heuristics. However, in some cases, and due to the complexity of the hybrid method, a longer convergence time may be needed at each iteration process than with the traditional meta-heuristics. Furthermore, hybrid approaches could require a longer scheduling time.
In this research, an empirical comparison has been conducted based on three types of heuristic approaches, Single-based, Hybrid-based, and Hyper-based, to evaluate the effectiveness of the proposed cost optimization model.
After conducting an extensive literature review, the authors have considered three well-known single-optimization-based (population-oriented) heuristic techniques: (i) Genetic Algorithm (GA), (ii) Particle Swarm Optimization (PSO), and (iii) Invasive Weed Optimization (IWO) [
In total, five core phases of HIWO have been proposed: (i) initialization, (ii) reproduction (duplication), (iii) mutation, (iv) spatial dispersal, and (v) competitive exclusion. Algorithm-3 presents the core phases of HIWO.
This phase accepts the main inputs of population size and the problem size. As previously discussed, the population size and problem size refer to the set of schedulable workflow tasks and set of VMs respectively. In this research context, each weed represents a scheduling solution, which is encoded using a pair of one-dimensional strings. Each string represents an ordered task listing to be scheduled using an available VM. So, a population of initial weeds represents one single scheduling effort from the random positions covering a one-dimensional large problem space.
This phase aims at computing the attained fitness price that is attained completion time, grounded on the fitness function of the generated weeds. Each of the generated weeds is regarded as a single solution. So, the HIWO works on the sub-simulation (i.e., local solution) in order to find the minimal value with respect to the completed cost and time.
This stage aims at enhancing the colony range by artificially improving (and replacing) a single weed with the best solution, which is regarded as NewSeeds in this research context. This operation helps in avoiding premature convergence and the probability of skipping the global-optimum cost for multi-dimensional problems. Importantly, the mutation operation assists in speeding up the searching operation, crucial to finding the best solution.
This stage focuses on randomly dispensing the generated NewSeeds (regarded as children) over d-dimensional problem search space grounded on the conventional random numbers’ distribution. It is important to minimize the iterations range during the searching process based on the initial price (
where
This phase mainly aims at excluding an unwanted plant based on the attained bad (low) fitness. This economical mechanism permits searching for a good (fitter) flower in order to intimate the additional seeds, and this continues until the defined stooping standard has been met. Finally, the plant with good time and cost (compared to others) is picked and yields as a better solution.
In the current research work, the authors have employed a Dynamic Hyper-Heuristic Algorithm (DHHA) regarded as a hyper-heuristic technique, motivated from their prior work. Complete details on the DHHA approach can be found in our previous work [
The evaluation of any approach using the simulation environment allows researchers to compare application configurations under controlled conditions. Several types of simulation tools have been utilized by researchers to evaluate the cost optimization of SWFS approaches in cloud computing (e.g., CloudSim [
The extensive evaluation in the current study was conducted through WorkflowSim by setting it up on an eclipse editor. The data collected from executing the actual SWFA was used to generate synthetic workflows resembling those used by real-world scientific applications. The simulation-based environment helped in clearly understanding the scheduling process. At the same time, it determined different scenarios (i.e., the number of VMs and size of SWFA) to fully investigate the performance of the proposed model.
Using the simulation experimentation environment, the proposed cost optimization model was evaluated using SWFA datasets (i.e., SIPHT). For each of these SWFAs, three dataset sizes were considered to evaluate the data intensiveness and computational intensiveness of the proposed approach. Different types of statistical analysis were conducted for the collected results. The details of the evaluation are provided in the remainder of this section.
As already defined, SWFAs consist of multiple tasks necessary to accomplish the given workflow. Note that the main elements of the tasks are some of the executable instances, such as data, programs, load sets, and reports sets. The other correlation in a scientific workflow application is the relationship between the tasks/jobs and their data dependencies. There are five main types of these relationships: Process, pipeline, data distribution, data aggregation, and data redistribution.
Additionally, in the clustering stage of WorkflowSim, the tasks with the same type of process can be represented as a job. In this way, similar tasks can be executed one time instead of repeating the same execution several times.
(a) Completion time comparison | (b) Total computational cost | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
GA | PSO | HIWO | IWO | DHHA | GA | PSO | HIWO | IWO | DHHA | ||
Scenario-1 | Scenario-1 | ||||||||||
AVERAGE | 3.48 | 3.48 | 3.48 | 3.48 | 3.348.466.667 | AVERAGE | 0.004084128 | 0.004084128 | 0.004084128 | 0.004084128 | 0.00392976 |
MIN | 3.48 | 3.48 | 3.48 | 3.48 | 2.71 | MIN | 0.004084128 | 0.004084128 | 0.004084128 | 0.004084128 | 0.003180456 |
MAX | 3.48 | 3.48 | 3.48 | 3.48 | 3.658 | MAX | 0.004084128 | 0.004084128 | 0.004084128 | 0.004084128 | 0.004293029 |
S.D. | 9,03E −11 | 9,03E −11 | 9,03E −11 | 9,03E −11 | 0.290713122 | S.D. | 1,86E −13 | 1,86E −13 | 1,86E −13 | 1,86E −13 | 0.000341181 |
Scenario-2 | Scenario-2 | ||||||||||
AVERAGE | 7.95 | 8.17 | 7.342 | 7.764.666.667 | 6.505.333.333 | AVERAGE | 0.01866024 | 0.019176624 | 0.017233142 | 0.018225226 | 0.015269318 |
MIN | 6.74 | 6.74 | 3.7 | 3.7 | 3.7 | MIN | 0.015820128 | 0.015820128 | 0.00868464 | 0.00868464 | 0.00868464 |
MAX | 10.04 | 10.04 | 10.04 | 10.04 | 10.04 | MAX | 0.023565888 | 0.023565888 | 0.023565888 | 0.023565888 | 0.023565888 |
S.D. | 1.617.437.309 | 1.663.222.879 | 2.380.723.448 | 2.251.728.429 | 2.540.197.838 | S.D. | 0.003796449 | 0.003903917 | 0.005588034 | 0.005285257 | 0.005962352 |
Scenario-3 | Scenario-3 | ||||||||||
AVERAGE | 1.617.333.333 | 1.847.973.333 | 1.913.333.333 | 1.962.666.667 | 1.300.073.333 | AVERAGE | 0.075924096 | 0.08675126 | 0.08981952 | 0.092135424 | 0.061030643 |
MIN | 7.54 | 4.14 | 11.24 | 8.316 | MIN | 0.035395776 | 0.019434816 | 0.052765056 | 0.052765056 | 0.03903863 | |
MAX | 26.04 | 26.04 | 26.04 | 26.04 | 16.906 | MAX | 0.122242176 | 0.122242176 | 0.122242176 | 0.122242176 | 0.079363526 |
S.D. | 6.893.591.894 | 7.378.023.055 | 4.632.861.135 | 5.226.572.249 | 197.026.289 | S.D. | 0.032361278 | 0.034635391 | 0.021748503 | 0.024535621 | 0.009249202 |
Scenario-4 | Scenario-4 | ||||||||||
AVERAGE | 10.87 | 1.087.833.333 | 10.87 | 10.87 | 106.196 | AVERAGE | 0.012757032 | 0.012766812 | 0.012757032 | 0.012757032 | 0.012463163 |
MIN | 10.87 | 10.87 | 10.87 | 10.87 | 10.244 | MIN | 0.012757032 | 0.012757032 | 0.012757032 | 0.012757032 | 0.012022358 |
MAX | 10.87 | 10.94 | 10.87 | 10.87 | 10.87 | MAX | 0.012757032 | 0.012839184 | 0.012757032 | 0.012757032 | 0.012757032 |
S.D. | 3,61E −10 | 0.021668877 | 3,61E −10 | 3,61E −10 | 0.311918822 | S.D. | 2,48E −13 | 2,54E+00 | 2,48E −13 | 2,48E −13 | 0.000366068 |
Scenario-5 | Scenario-5 | ||||||||||
AVERAGE | 25.35 | 2.627.396.667 | 2.615.666.667 | 2.905.333.333 | 20.87 | AVERAGE | 0.05950152 | 0.061670255 | 0.061394928 | 0.068193984 | 0.048986064 |
MIN | 20.3 | 13.134 | 11.2 | 20.3 | 11.2 | MIN | 0.04764816 | 0.030828125 | 0.02628864 | 0.04764816 | 0.02628864 |
MAX | 30.4 | 83.499 | 30.4 | 30.4 | 30.4 | MAX | 0.07135488 | 0.195988853 | 0.07135488 | 0.07135488 | 0.07135488 |
S.D. | 513.633.104 | 1.261.821.654 | 7.057.482.375 | 3.492.033.627 | 6.913.263.115 | S.D. | 0.012055996 | 0.029617478 | 0.016565323 | 0.008196501 | 0.016226811 |
Scenario-6 | Scenario-6 | ||||||||||
AVERAGE | 4.993.866.667 | 4.983.466.667 | 4.853.506.667 | 4.773.466.667 | 3.848.333.333 | AVERAGE | 0.234432077 | 0.233943859 | 0.227843017 | 0.224085619 | 0.18065616 |
MIN | 11.64 | 11.64 | 17.58 | 11.64 | 26.7 | MIN | 0.054642816 | 0.054642816 | 0.082527552 | 0.054642816 | 0.12534048 |
MAX | 73.6 | 73.6 | 73.6 | 73.6 | 63.1 | MAX | 0.34550784 | 0.34550784 | 0.34550784 | 0.34550784 | 0.29621664 |
S.D. | 2.126.467.489 | 1.840.849.649 | 2.050.696.816 | 2.033.561.302 | 1.071.438.232 | S.D. | 0.09982489 | 0.086416846 | 0.096267911 | 0.095463502 | 0.050297596 |
Scenario-7 | Scenario-7 | ||||||||||
AVERAGE | 106.68 | 106.68 | 106.68 | 106.68 | 1.067.043.333 | AVERAGE | 0.125199648 | 0.125199648 | 0.125199648 | 0.125199648 | 0.125228206 |
MIN | 106.68 | 106.68 | 106.68 | 106.68 | 106.68 | MIN | 0.125199648 | 0.125199648 | 0.125199648 | 0.125199648 | 0.125199648 |
MAX | 106.68 | 106.68 | 106.68 | 106.68 | 106.734 | MAX | 0.125199648 | 0.125199648 | 0.125199648 | 0.125199648 | 0.125263022 |
S.D. | 2,89E −09 | 2,89E −09 | 2,89E −09 | 2,89E −09 | 0.025008045 | S.D. | 1,98E −12 | 1,98E −12 | 1,98E −12 | 1,98E −12 | 2,93E+00 |
Scenario-8 | Scenario-8 | ||||||||||
AVERAGE | 241.056 | 2.466.398.667 | 2.440.752.667 | 2.255.513.333 | 1.706.413.333 | AVERAGE | 0.565806643 | 0.578913095 | 0.572893466 | 0.52941409 | 0.400529338 |
MIN | 106.68 | 60.04 | 78.674 | 106.68 | 106.68 | MIN | 0.250399296 | 0.140925888 | 0.184663613 | 0.250399296 | 0.250399296 |
MAX | 290.8 | 290.928 | 290.828 | 290.8 | 193.9 | MAX | 0.68256576 | 0.682866202 | 0.682631482 | 0.68256576 | 0.45512208 |
S.D. | 7.582.318.823 | 6.649.934.373 | 774.065.922 | 8.103.379.247 | 3.922.950.154 | S.D. | 0.177972187 | 0.15608726 | 0.181688753 | 0.190202518 | 0.092079486 |
Scenario-9 | Scenario-9 | ||||||||||
AVERAGE | 5.109.004 | 549.2 | 503.98 | 523.36 | 423.23 | AVERAGE | 2.398.370.838 | 257.816.448 | 2.365.883.712 | 2.456.861.184 | 1.986.810.912 |
MIN | 97.488 | 290.8 | 290.8 | 193.9 | 290.8 | MIN | 0.457647667 | 136.513.152 | 136.513.152 | 0.91024416 | 136.513.152 |
MAX | 678.4 | 678.4 | 678.4 | 678.4 | 581.5 | MAX | 318.468.096 | 318.468.096 | 318.468.096 | 318.468.096 | 27.297.936 |
S.D. | 2.088.424.202 | 1.147.477.203 | 1.597.305.131 | 1.718.391.822 | 1.284.599.928 | S.D. | 0.980389857 | 0.538671698 | 0.749838921 | 0.806681857 | 0.60304259 |
SIPHT-search for sRNAs Workflow Application: The bioinformatics project at Harvard University is conducting a wide search for small untranslated Ribonucleic Acids (sRNAs) that regulate several processes, such as secretion or virulence in bacteria. The SIPHT workflow is composed of small (30), medium (100), and large (1000) tasks.
For running the simulation experiments, a PC with the following specifications was used: operating system: Windows 10 Pro (64 bit); processor: Intel(R) Core (TM) i7-3770 CPU@3.40 GHz; and memory (RAM): 12.0 GB. As WorkflowSim is programmed using Java, Eclipse, an integrated development environment, was used to run the WorkflowSim codes and to implement the proposed approach. The code for WorkflowSim (Version 1.0) is directly imported from the GitHub website [
It is worth noting that the prices of VM are based on the EC2 pricing list. In this research, the type of available computational resources (i.e., VMs) has been selected based on the comprehensive study of the existing approaches discussed in Section 2 and based on the WorkflowSim, where the considered VM instance is equivalent to t2.small instance of the Amazon EC2 website [
In order to substantially analyze the collected data from experimentation, four kinds of widely used descriptive statistical analysis were performed: minimum, maximum, standard deviation (S.D), and average. The total required time to finish executing the submitted workflow tasks is calculated as the completion time. Moreover, due to the static submitting criteria, the total computational cost is calculated by multiplying the task completion time with the number of considered virtual machines together with the cost of each virtual machine. In this section, the descriptive statistical analysis of the experimental results is presented and discussed for each of the considered SWFAs (i.e., SIPHT-search for sRNAs SWFAs).
Based on the workflow tasks (datasets size), the first three scenarios are considered the smallest (30 tasks, with 2, 4, 8 VMs) datasets scenarios, while the last three are the largest (1000 tasks, with 2, 4, 8 VMs).
This research work has proposed a novel cost optimization model that contains the three main stages of scientific workflow application, targeted computational environment, and cost optimization criteria, and the three cost parameters categories of pre-scheduling, during scheduling, and post-scheduling. An empirical comparison of Single-based, Hybrid-based, and Hyper-based heuristic approaches has been provided while considering different numbers of VMs and different sizes of SWFAs.
Statistical analysis has been applied to the collected data by running the heuristic approaches aimed at supporting the proposed SWFS based cost optimization model in a WorkflowSim experimentation environment. SIPHT scientific workflow application with different sizes of datasets was executed to determine completion time (makespan) and total computational cost. For completion time results, it is concluded that the Single-based heuristic approaches lack in achieving good results compared to the Hybrid-based and Hyper-based approaches. This is due to the nature of the solution proposed by these algorithms, which have limitations in considering a more optimal solution. In some cases, the S.D. values of these algorithms were equal to zero because there was no variation to be measured. Besides this, the SIPHT SWFA has a longer completion time (makespan) compared to the other SWFA for all sizes of datasets. This is because the size of tasks is large compared to the other SWFAs and the data dependencies (precedence constraints) between the SIPHT SWFA tasks are more complex than in the other SWFA. For total computational cost results, similar to the completion time results, the Hybrid-based and Hyper-based approaches showed lower computational cost than the Single-based approaches. In the future, implementing the proposed model in the real-world using a hybrid cloud environment would be an interesting direction in this research domain. Finally, the authors plan to consider other cost parameters such as the cost of storage and communication cost.