CN103324509A

CN103324509A - Method for installing bioinformatics application programs in high-performance cluster system

Info

Publication number: CN103324509A
Application number: CN2013102601126A
Authority: CN
Inventors: 姜金良; 马少杰; 曹振南; 李斌; 赵明坤; 侯雪峰; 何沧平; 田相桂; 杨亮; 易成; 曹征; 苗春葆; 胡耀国; 范娟
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2013-06-26
Filing date: 2013-06-26
Publication date: 2013-09-25

Abstract

The invention discloses a method for installing bioinformatics application programs in a high-performance cluster system. The method includes the steps: loading environment variables of the bioinformatics application programs; selecting a corresponding math library according to system types and network configuration of a current installation platform; installing the bioinformatics application programs by the aid of the environment variables and the math library. By the method for installing the bioinformatics application programs in the high-performance cluster system, the efficiency of installing the bioinformatics application programs in the high-performance cluster system is improved.

Description

The method of bioinformatics class application program is installed in High Performance Cluster System

Technical field

The present invention relates to investigation of materials basically, more specifically, relates to a kind of method that bioinformatics class application program is installed in High Performance Cluster System.

Background technology

Bioinformatics be take computing machine as instrument to biological information collect, the science of disposal and utilization.Research object is generally protein and the large molecule of DNA, on the one hand because the complicacy of this body structure of research object, on the other hand because the develop rapidly of sequencing technologies, the human gene order number of finding is according to exponential growth, study for the gene that quantity like this is huge, often be accompanied by googol according to treatment capacity and parallel computation amount.

Have a lot in the bioinformatics research: such as utilizing experimental apparatus that gene etc. is checked order and rough handling---the sequenator processed offline of measurement data, DNA sequencer is for the senior test apparatus of measuring DNA (gene) sequence, be used for paternity test, individual identification, genetic profile, paternal line evaluation, maternal evaluation, race's evaluation, kind evaluation, and the diagnosis of some disease etc., be requisite instrument and equipment in the life science, the important tool that obtains important development of scientific research.DNA sequencer is expensive, its research process is divided into preparation reagent, the instrument processed offline that instrument checks order last, thus obtain the gene order that scientist can identification, on this basis, scientist can utilize that the sequence that measure to obtain is spliced, comparison, homology analysis etc.; Sequence alignment mainly is reconstruct DNA complete sequence from overlapped sequence fragment, under various experiment conditions, from detection data, determine physics and gene map storage, dna sequence dna in traversal and the comparison database, the similarity of more two or more sequences, search correlated series and subsequence in database, seek the continuous generation pattern of nucleotide, find out the informational content of protein and dna sequence dna; Molecular docking is simulated the macromolecular interaction of little molecule ligand and acceptor according to the lock of part and acceptor-key principle, by calculating prediction binding pattern and affinity between the two, thereby carries out the virtual screening of medicine.

Program commonly used has abyss, allpathslg, amos, autodock, blast, clustal-omega, clustalw, clustalw-mpi, dock, emboss, exonerate, fasta, fsa, hmmer, mira, mpiblast, mpihmmer, mummer, velvet, wgs etc.

Usually the installation of bioinformatics class application program deployment all is manual execution, this mounting means comes with some shortcomings: program compilation, installation process are comparatively complicated, need the artificial parameter that arranges more, the manual installation complex operation, waste time and energy, if the compilation operations flow process is unfamiliar with, be easy to occur mistake.Need to carry out different parameter configuration with network environment for different hardware platforms in the installation process, all can cause executing efficiency lowly or even the operation result mistake to being unfamiliar with of operating system, compiler, math library, hardware system and network environment.Need to configure corresponding environmental variance after the installation success, to be user-friendly to, manual configuration is easily made mistakes, and when the application program kind is many, easily causes environmental variance that confusion, conflict are set.

Summary of the invention

Defective for above-mentioned prior art, the present invention proposes a kind of method that bioinformatics class application program is installed in High Performance Cluster System, solved and how to have improved the technical matters that the efficient that bioinformatics class application program is installed in the High Performance Cluster System is installed.

The present invention proposes a kind of a kind of automatic installation method of HPCC bioinformatics class application program.This application program realizes the robotization unmanned installation of multiple material physics class application program, comprises abyss, allpathslg, amos, autodock, blast, clustal-omega, clustalw, clustalw-mpi, dock, emboss, exonerate, fasta, fsa, hmmer, mira, mpiblast, mpihmmer, mummer, velvet, wgs etc.; This program other program environment that first self-verifying relies on before configuration bioinformatics class application program is installed; In the process of Auto-mounting configuration, be configured parameter adjustment and optimization according to the network environment of HPCC; Automatic configuration surroundings variable after the installation, and be provided at the script example of submitting required by task in the group system to; In the whole installation process, the dynamic reminding installation progress provides the corresponding prompting that reports an error if there is mistake.

According to an aspect of the present invention, provide a kind of method that bioinformatics class application program is installed in High Performance Cluster System, having comprised: step S1: the environmental variance that is written into described bioinformatics class application program; Step S2: system type and network configuration according to current mounting platform are selected corresponding math library; Step S3: utilize described environmental variance and described math library, described bioinformatics class application program is installed.

In described method, before described step S2, described method also comprises: whether whether the source program that checks described bioinformatics class application program exists with the installation targets file can normally create, if so, and execution in step S2 then.

In described method, before described step S2, described method also comprises: described system type and the described network configuration of obtaining current mounting platform.

In described method, described system type comprises the operating system version of current mounting platform.

In described method, the described system type that obtains current mounting platform comprises: the operating system version that obtains current mounting platform by the system file of checking current mounting platform.

In described method, described network configuration comprises whether disposing the Infiniband network interface card.

In described method, described system type and the described network configuration of obtaining current mounting platform comprise: the operating system version that obtains current mounting platform by the system file of checking current mounting platform; Check and whether configured the Infiniband network interface card in the current mounting platform; And check whether described Infiniband network interface card has been installed and drive and whether can normally move.

In described method, described environmental variance comprises integrated environment variable and specific environment variable, described integrated environment variable comprises the installation targets path of installation process subroutine, described bioinformatics class application program source program point and described bioinformatics class application program, and described specific environment variable comprises compiler and MPI.

In described method, described method also comprises: the output information that generates in the installation process is preserved.

In described method, described method also comprises: the script example that is created on described High Performance Cluster System submit job for described bioinformatics class application program, wherein, described script example content comprises resource bid mode and the application program method of operation of described bioinformatics class application program.

Improved the efficient that installation bioinformatics class application program in the High Performance Cluster System is installed by the method that bioinformatics class application program is installed provided by the present invention in High Performance Cluster System.

Description of drawings

Accompanying drawing is used to provide a further understanding of the present invention, and consists of the part of instructions, is used for together with embodiments of the present invention explaining the present invention, is not construed as limiting the invention.In the accompanying drawings:

Fig. 1 is the process flow diagram according to the General Implementing example of the method that bioinformatics class application program is installed in High Performance Cluster System of the present invention;

Fig. 2 is the process flow diagram according to the specific embodiment of the method that bioinformatics class application program is installed in High Performance Cluster System of the present invention;

Fig. 3 is the process flow diagram according to the example of the method that bioinformatics class application program is installed in High Performance Cluster System of the present invention.

Embodiment

Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for description and interpretation the present invention, is not intended to limit the present invention.

Fig. 1 is the process flow diagram according to the General Implementing example of the method that bioinformatics class application program is installed in High Performance Cluster System of the present invention.In Fig. 1:

Step S100: the environmental variance that is written into bioinformatics class application program.Wherein, environmental variance can comprise integrated environment variable and specific environment variable, the integrated environment variable comprises the installation targets path of installation process subroutine, bioinformatics class application program source program point and bioinformatics class application program, and the specific environment variable comprises compiler and MPI.

Step S102: system type and network configuration according to current mounting platform are selected corresponding math library.Wherein, system type can comprise the operating system version of current mounting platform.In a preferred embodiment, can obtain by the system file of checking current mounting platform the operating system version of current mounting platform.Wherein, operating system can comprise the mainstream high performance Clustering OSs such as Red Hat, Suse, CentOS.In addition, network configuration can comprise whether disposing the Infiniband network interface card, can also detect further whether the Infiniband network interface card is equipped with driver and whether this network interface card normally moves.

Step S104: utilize environmental variance and math library that bioinformatics class application program is installed.

Simplify the installation procedure of bioinformatics class application program by the method for the disclosed installation bioinformatics of the present embodiment class application program, reduced installation difficulty; Be mounted to power and installation quality by what the modes such as dependence judgement, fault-tolerance judgement, standard configurations had improved application program, at utmost avoided the human operational error; Greatly improve the installation of bioinformatics class application program by unattended mode and disposed efficient, saved time and manpower.

Fig. 2 is the process flow diagram according to the specific embodiment of the method that bioinformatics class application program is installed in High Performance Cluster System of the present invention.In Fig. 2:

Step S200: the environmental variance that is written into bioinformatics class application program.Wherein, environmental variance can comprise integrated environment variable and specific environment variable, the integrated environment variable comprises the installation targets path of installation process subroutine, bioinformatics class application program source program point and bioinformatics class application program, and the specific environment variable comprises compiler and MPI.

Step S202: whether whether the source program of inspection bioinformatics class application program exists with the installation targets file can normally create, and if so, then carries out following step, if not, then withdraws from installation.

Step S204: system type and the network configuration of obtaining current mounting platform.Wherein, system type can comprise the operating system version of current mounting platform.In a preferred embodiment, can obtain by the system file of checking current mounting platform the operating system version of current mounting platform.Wherein, operating system can comprise the mainstream high performance Clustering OSs such as Red Hat, Suse, CentOS.In addition, network configuration can comprise whether disposing the Infiniband network interface card, can also detect further whether the Infiniband network interface card is equipped with driver and whether this network interface card normally moves.

Step S206: system type and network configuration according to current mounting platform are selected corresponding math library.

Step S208: utilize the compiling of environmental variance and math library that bioinformatics class application program is installed.

Step S210: be created on the script example of High Performance Cluster System submit job for bioinformatics class application program, wherein, the script example content comprises resource bid mode and the application program method of operation of bioinformatics class application program.For application program generates the script example at the group system submit job, example file comprises two parts: how to apply for computational resource, how to run application.High Performance Cluster System one general configuration job scheduling system, the script the inside of preparing comprises how applying for computational resource, fill order etc. how, this part content and application program are irrelevant, depend on the setting of dispatching system, the most frequently used pbs dispatching system of selecting among the present invention, the parameter that needs to arrange has " #PBS-lnodes=1:ppn=2 ", " #PBS-q low " etc.; Exectorial mode can be different according to the network condition that different application programs, early stage detect, if configured the Infiniband network, what select is the mpi storehouse of Openmpi, need to add "--mca btl self, openib " parameter etc.The resource required according to actual conditions when using made simple modification and got final product.

Fig. 3 is the process flow diagram according to the example of the method that bioinformatics class application program is installed in High Performance Cluster System of the present invention.In this example:

The first step: loader bag integrated environment variable mainly comprises the subroutine that needs in the installation process to use, application program source program position, installation targets path etc.

Second step: be written into the environmental variance that set up applications needs, most bioinformatics class application program is multi-thread formula programming, can only move at the separate unit server, and the environmental variance that need to be written into mainly is compiler.Individual application is supported multi-node parallel, and such as mpiblast, mpihummer etc. also need to be written into the environmental variance in MPI storehouse except compiler.After being written into the environment that needs, whether test compiler etc. can normally use.

The 3rd step: whether the source program that checks application program exists, whether the installation targets file normally creates etc.

The 4th step: check the High-Performance Computing Cluster environment, comprise operating system version, network etc.

Operating system version can obtain by checking the system file setting, and the operating system of supporting at present comprises the mainstream high performance Clustering OSs such as Red Hat, Suse, CentOS.

Network Check mainly is need to carry out this part content when supporting the application program of multinode to install, such as mpiblast, mpihummer, whether Detection of content for having configured High Speed I nfiniband net, and the intended application program uses this network, mainly from released by checking two sections:

(1) checks in the server whether configured High Speed I nfiniband network interface card.

Whether be that the Infiniband network interface card has been installed driving (2), whether network card status is normal.

The 5th step: according to the system information that obtains, the variable that the Lookup protocol installation needs compiles installation to program.

The 6th step: for application program generates the script example at the group system submit job, example file comprises two parts: how to apply for computational resource, how to run application.High Performance Cluster System one general configuration job scheduling system, the script the inside of preparing comprises how applying for computational resource, fill order etc. how, this part content and application program are irrelevant, depend on the setting of dispatching system, the most frequently used pbs dispatching system of selecting among the present invention, the parameter that needs to arrange has " #PBS-lnodes=1:ppn=2 ", " #PBS q low " etc.; Exectorial mode can be different according to different application programs, for mpiblast, the application program of this support multinode of mpihummer, if configured the Infiniband network, what select is the mpi storehouse of Openmpi, need to add "--mca btl self, openib " parameter, if other single node application program does not then need specified network parameter etc.The resource required according to actual conditions when using made simple modification and got final product.

Routine package can be preserved the output that installation process produces in installation process, if improper withdrawing from can be checked the file of preservation, searches causes of mistake.

The use of routine package: the order that has an install.sh behind the routine package decompress(ion), enter the file of routine package, fill order: the name of ./install.sh--＜application program 〉.Can realize afterwards the Auto-mounting of application program.

The present invention proposes a kind of automatic installation method of HPCC bioinformatics class application program.Greatly simplify the installation procedure of bioinformatics class application program by the mode of robotization, reduced installation difficulty; Be mounted to power and installation quality by what the modes such as dependence judgement, fault-tolerance judgement, standard configurations had improved application program, at utmost avoided the human operational error; Greatly improve the installation of bioinformatics class application program by unattended mode and disposed efficient, saved time and manpower.The method and program are widely used in the automatic Fast Installation of the HPCC bioinformatics class application program of different scales and dispose, and also are applicable in the dynamically changeable environment (such as cloud computing) and interim computational resource is carried out high-performance calculation program environment rapid configuration dispose.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the method that bioinformatics class application program is installed in High Performance Cluster System is characterized in that, comprising:

Step S1: the environmental variance that is written into described bioinformatics class application program;

Step S2: system type and network configuration according to current mounting platform are selected corresponding math library;

Step S3: utilize described environmental variance and described math library that described bioinformatics class application program is installed.

2. method according to claim 1, it is characterized in that, before described step S2, described method also comprises: whether whether the source program that checks described bioinformatics class application program exists with the installation targets file can normally create, if so, execution in step S2 then.

3. method according to claim 1 and 2 is characterized in that, before described step S2, described method also comprises: described system type and the described network configuration of obtaining current mounting platform.

4. method according to claim 3 is characterized in that, described system type comprises the operating system version of current mounting platform.

5. method according to claim 4 is characterized in that, the described system type that obtains current mounting platform comprises: the operating system version that obtains current mounting platform by the system file of checking current mounting platform.

6. method according to claim 5 is characterized in that, described network configuration comprises whether disposing the Infiniband network interface card.

7. method according to claim 6 is characterized in that, described system type and the described network configuration of obtaining current mounting platform comprise:

Obtain the operating system version of current mounting platform by the system file of checking current mounting platform;

Check and whether configured the Infiniband network interface card in the current mounting platform; And

Whether check whether described Infiniband network interface card has been installed drives and can normally move.

8. method according to claim 7, it is characterized in that, described environmental variance comprises integrated environment variable and specific environment variable, described integrated environment variable comprises the installation targets path of installation process subroutine, described bioinformatics class application program source program point and described bioinformatics class application program, and described specific environment variable comprises compiler and MPI.

9. method according to claim 1 is characterized in that, described method also comprises: the output information that generates in the installation process is preserved.

10. method according to claim 1, it is characterized in that, described method also comprises: the script example that is created on described High Performance Cluster System submit job for described bioinformatics class application program, wherein, described script example content comprises resource bid mode and the application program method of operation of described bioinformatics class application program.