[summary of the invention]
Based on this, be necessary to provide a kind of distributed computing method, can improve different Distributed Calculation flatPortability between platform.
A kind of distributed computing method, comprises the following steps:
Based on the general utility functions of different distributions formula computing platform, build the unification between multiple Distributed Computing PlatformsDLL;
According to user's application demand, programme by described unified DLL, build Distributed ApplicationProgram;
Call distribution script described distributed application program is submitted to Distributed Computing Platform, start pendingTask, under described Distributed Computing Platform, carry out described distributed application program.
In a preferred embodiment, described unified DLL comprise context interface, for resolving input numberAccording to the interface, file access interface and the input and output byte stream interface that also generate output data.
In a preferred embodiment, described in, call distribution script described distributed application program is submitted to distributionThe step of formula computing platform comprises:
Obtain input file list, generate pending listed files;
According to pending listed files described in the mapper number cutting in Distributed Computing Platform;
Pending listed files after described cutting is submitted to described Distributed Computing Platform.
In a preferred embodiment, described in, call distribution script described distributed application program is submitted to distributionThe step of formula computing platform also comprises:
Collect the configuration information of input by command line parameter;
According to described configuration information, raw respectively to the mapper in described Distributed Computing Platform and reducerBecome encapsulation script;
Described encapsulation script is submitted to described Distributed Computing Platform.
In a preferred embodiment, in the pending listed files after described cutting, recorded pending fileFile path;
The described step of carrying out described distributed application program under Distributed Computing Platform also comprises:
Obtain the pending listed files after cutting by described Distributed Computing Platform, according to described pendingThe file path of file is processed described pending file, output result.
In addition, be also necessary to provide a kind of distributed computing system, can improve different Distributed Calculation flatPortability between platform.
A kind of distributed computing system, comprising:
Platform package module, for the general utility functions based on different distributions formula computing platform, builds multiple distributionsUnified DLL between formula computing platform;
Application package module, for the application demand concrete according to user, enters by described unified DLLRow programming, builds distributed application program;
Executive Module, for calling distribution script, that described distributed application program is submitted to Distributed Calculation is flatPlatform, starts pending task, carries out described distributed application program on described Distributed Computing Platform.
In a preferred embodiment, described unified DLL comprise context interface, for resolving input numberAccording to the interface, file access interface and the input and output byte stream interface that also generate output data.
In a preferred embodiment, described execution package module comprises:
Listed files generation module, for obtaining input file list, generates pending listed files;
Cutting module, for according to pending file described in the mapper number cutting of Distributed Computing PlatformList, and the pending listed files after described cutting is submitted to described Distributed Computing Platform.
In a preferred embodiment, described execution package module also comprises:
Configuration information collection module, for collecting the configuration information of input by command line parameter;
Encapsulation script generation module, for according to described configuration information, in described Distributed Computing PlatformMapper and reducer generate respectively encapsulation script, and described encapsulation script is submitted to described Distributed CalculationPlatform.
In a preferred embodiment, in the pending listed files after described cutting, recorded pending fileFile path;
Described execution package module also comprises:
Processing module, for obtain the pending listed files after cutting by described Distributed Computing Platform,According to the file path of described pending file, described pending file is processed output result.
Above-mentioned distributed computing method and system, by building the unified DLL of each Distributed Computing Platform,Basic function general each Distributed Computing Platform (being also the most frequently used most important function conventionally) can be addedEnter in unified DLL, and encapsulate the concrete of unified DLL according to different Distributed Computing PlatformsRealize, make the unified DLL building isolate the specific implementation of different Distributed Computing Platforms; ProfitProgramme with unified interface, developer is without a lot of details of being concerned about Distributed Computing PlatformWith dialect; In the time carrying out the distributed application program generating, distribution script has been isolated different distributed metersOtherness while calculating platform submission task, makes generated distributed application program not change or seldom changeIn moving situation, can on different Distributed Computing Platforms, carry out, therefore improve different distributed metersCalculate the portability between platform.
[detailed description of the invention]
In one embodiment, as shown in Figure 1, a kind of distributed computing method, comprises the following steps:
Step S102, based on the general utility functions of different distributions formula computing platform, builds multiple Distributed Calculation flatThe unified DLL of interstation.
Step S104, according to user's application demand, programmes by described unified DLL, buildsDistributed application program.
Step S106, calls distribution script distributed application program is submitted to Distributed Computing Platform, startsPending task is carried out distributed application program under Distributed Computing Platform.
Different Distributed Computing Platforms is to have certain general character, and different Distributed Computing Platforms canTo realize some general utility functions, these general utility functions are also the most frequently used most important basic functions. For example, prop upThe Distributed Computing Platform of holding MapReduce (a kind of computation model, for large-scale data processing) is mostThere is following general character: have distributed file system, each file system has corresponding access interface;MapReduce routine processes input data are also exported key-value pair (Key-valuepair) or result is directly defeatedGo out in distributed file system; There is the function of importing the configuration parameter of task from the external world into; There is taskRelevant statistical information and status information; When submission task, to provide mapper (in MapReduce for realityThe user application of existing Map step), reducer is (in MapReduce for realizing Reduce stepUser application, and can specify by how many mapper, reducer executed in parallel etc.
Based on these general character of Distributed Computing Platform, build the unified programming between multiple Distributed Computing PlatformsInterface, makes general basic function that these DLLs have each Distributed Computing Platform (conventionally alsoThe most frequently used most important function). The Distributed Computing Platform of what as shown in Figure 2, left side represented is bottom andDistributed file system. For some Distributed Computing Platforms, its distributed file system is included in platformAmong, in Fig. 2, the just division of functional module, is not concrete system architecture, below repeats no more.
As shown in Figure 2, right side part has been shown the unified DLL of framework, comprise IContext interface,ISolver interface, I/OStream interface and IFile interface. Wherein:
IContext interface is context interface, HadoopContext class and NativeContext in Fig. 2Class is all the specific implementation of this interface. The defined function of IContext interface comprises: initialize/anti-initialChange function, for initializing, analyse the data structure of structure oneself; Upgrade task status function, for to distributionThe executing state of formula file system feedback current task; Refresh counter function, for adding up some task lettersBreath, for example, processed how many records etc.; Output collecting function, for collecting the key-value pair of output; Read and joinPut informational function, for reading the configuration information of implementation period; Input global configuration informational function, for obtainingAbout the configured in one piece information of task, for example machine of login, user name, password etc.; Read and set and work asThe function in preceding document path, is used to specify the comspec of the current input file of processing; Open otherThe function of iostream, for opening the iostream of user's specified path.
ISolver interface is for resolving input data and generating the interface of exporting data, the function bag of its realizationDraw together: A. resolves the function of input word throttling; B. force to resolve the function of spare word throttling; C. generate the merit of outputEnergy. For example, LineSolver interface is a specific implementation of ISolver interface, for by input byteCirculation turns to the character string (being the text of a line a line) of a line a line. Each style of writing is originally from byte stream solutionThe object with business implication of separating out, is stored in the instantiation of ISolver interface. The byte stream of inputPass to ISolver interface, ISolver interface is resolved by function A, cutting line of text. When not having not moreWhen many input data are read from file, trigger the function B of ISolver interface, for by remaining byte streamGenerate a line text. Every generation a line text triggers the function C of ISolver interface, according to current thisRow text generation output, this output can and be opened other by the output collecting function of IContext interfaceIostream function completes. It is pointed out that ISolver interface can be according to service needed, processingThe data of arbitrary format, are not restricted to line of text.
As shown in Figure 3, the flow process of its function of ISolver Interface realization is specific as follows:
Step S302, reads appointment input file.
Step S304, judges whether to arrive end-of-file, if so, enters step S306, otherwise enters stepS312。
Step S306, calls the function B of ISolver interface.
Step S308, judges whether to generate new business object, if so, enters step S310, otherwise knotBundle. Business object is and the pending object of concrete traffic aided, for example, is a line text.
Step S310, calls the function C of ISolver interface.
Step S312, calls the function A of ISolver interface.
Step S314, judges whether to generate new business object, if so, enters step S316, otherwise entersEnter step S318.
Step S316, calls the function C of ISolver interface, returns to step S302.
Step S318, buffering area is full, reports an error.
IFile interface is file access interface; I/OStream interface is input and output byte stream interface, comprisesIInputSteam interface (input word throttling interface) and IOutputStream interface (output byte stream interface).As shown in Figure 2, because unified DLL has been isolated the specific implementation of concrete Distributed Computing Platform, useFamily only need to these DLLs alternately, without the details of Distributed Computing Platform of being concerned about bottom,When needs during from a platform transplantation to another platform, only need to be selected Distributed Application and carry out distributionWhat the Distributed Computing Platform of formula application program was corresponding realizes module. Therefore, the unified of above-mentioned structure compiledJourney interface has well been isolated the otherness of different distributions formula computing platform, by using these DLLs to openThe distributed application program of sending out can be flat in different Distributed Calculation in the situation that not changing or seldom changeOn platform, carry out, therefore improved the portability between different distributions formula computing platform.
As shown in Figure 2, mid portion has been shown the volume of writing according to the particular type of Distributed Computing PlatformThe specific implementation of journey interface. The operation of different Distributed Computing Platforms to file and the context of implementation periodInformation is all different, and the mid portion of Fig. 2 is file operation and the contextual information of program implementation period are enteredThe specific implementation of row. For example, HadoopFile class has encapsulated (the distributed literary composition of Hadoop platform to HDFSPart system) operation of file, XFSFile class has encapsulated to XFS (distributed file system of TBorg platform)The operation of file, NativeFile class has encapsulated the operation to local file, and HadoopContext class has encapsulatedThe contextual information of Hadoop program implementation period, NativeContext class has encapsulated local virtual distributed environmentThe contextual information of implementation period.
The mid portion that it should be noted that Fig. 2 has only been shown in an embodiment and has been distributed for HadoopThe encapsulation of formula computing platform, has different encapsulation for other different Distributed Computing Platforms, its methodPrinciple is identical, repeats no more at this. On user will be to the concrete Distributed Computing Platform of certain appointment,While utilizing the application of Development of Framework platform independence of the present embodiment, just need to use corresponding file operation and holdContextual specific implementation module of the departure date (as HadoopFile, HadoopContext). These realize module canTo be provided by the framework of the present embodiment, if framework does not provide, need user by IFile andThe interfaces such as IContext are realized.
In one embodiment, the distributed application program generating, needs to rely on distribution script to be submitted toDistributed Computing Platform. Distribution script is provided by the present embodiment scheme, for providing unified to userThe submission task state of platform independence, is submitted to concrete Distributed Calculation by distributed application program and dataPlatform. The submission instruction of different Distributed Computing Platforms is widely different, and distribution script is by accepting one group of systemThe function choosing-item of one form, and translated into the instruction that concrete Distributed Computing Platform is corresponding and go to carry out,Thereby isolate the otherness of different distributions formula computing platform in the time of the task of submission to. Script is defined carries in distributionHand over task state all very basic, conventional, important, be equivalent to get the task of each Distributed Computing PlatformSubmit the common factor of function to.
In the time that Distributed Computing Platform is supported MapReduce normal form, call the main option that distribution script is supportedComprise the executable program of listed files that user inputs, output directory, mapper and option, reducerExecutable program and option etc., these options are all general on various MapReduce Distributed Computing PlatformsBe suitable for. For example, (be only example, distribution script is not limited to down the grammer of the distribution script calling hereinState form, as long as there is identical function) be:
homework.py[OPTION]INPUTFILES...OUTPUT_DIR
Wherein, homework.py is the title of distribution script; INPUTFILES represents the path of input file,Can write many groups, cut apart with space, can use asterisk wildcard; OUTPUT_DIR represents the path that output is deposited;[OPTION] can comprise following option:
-m<mapper>,<num_of_key_fields>,<numberofmappers>
Wherein,<mapper>specify the Program path of mapper;<num_of_key_fields>be mapperThe shared field number of keypart in the key-value pair of output;<numberofmappers>represent to execute the taskThe number of mapper.-m parameter must be filled in.
-r<reducer>,<numberofreducers>
Wherein,<reducer>specify the Program path of reducer;<numberofreducers>represent to carry out and appointThe number of the reducer of business.-r parameter is optional.
-n<job-name>
Wherein,<job-name>represent the identification name of task, user can choose arbitrarily.
-o<gz|bz2>
Wherein,<gz|bz2>represent the acquiescence compressed format of output data, represent not compress if do not fill in.
-a<...>
Represent other parameters, for where necessary, can manually import some and Distributed Computing Platform into by userRelevant configuration information.
For example, the distribution script calling is: homework.py-mwc_map, 1,10-rwc_reduce, 3-oBz2-nword_countinput1/a*input2/b*output_dir, represents to call treating of this distribution script startupThe name of executing the task is called word_count, submits executable program wc_map to, and wc_reduce is to cluster,For the treatment of the file starting with b under the file starting with a under input1 catalogue and input2 catalogue, will tieFruit is stored in output_dir, and compresses with bz2. When Distributed Calculation, adopt 10 mapper and 3reducer。
As shown in Figure 4, in one embodiment, call distribution script distributed application program is submitted to pointThe step of cloth formula computing platform comprises following process:
Step S402, obtains input file list, generates pending listed files.
For different Distributed Computing Platforms, need to adopt different orders to obtain pending file rowTable. In pending listed files, record the complete trails of each pending file, and with the shape of textFormula is kept in a local temporary files.
Step S404, according to the pending listed files of mapper number cutting in Distributed Computing Platform.
Mapper number can be specified by user, each in the pending listed files generating in step S402File all will be delivered to mapper as input, because user can specify multiple mapper parallel processings, thereforeNeed to carry out cutting to pending listed files, to make the workload of each mapper as far as possible average.
In the present embodiment, carry out cutting by file number, for example, in pending listed files, have 10 and treatProcess file, user has specified 3 mapper, can be divided into (3,3,4), has two mapper respectivelyProcess 3 pending files, have a mapper to process 4 pending files. Pending literary composition after cuttingPart list is corresponding with each mapper, the number of the pending listed files after cutting and the number of mapperOrder is consistent, and the pending listed files after each cutting, as a temporary file, has wherein recorded eachThe path of the pending file of mapper.
Step S406, submits to Distributed Computing Platform by the pending listed files after cutting.
Pending listed files after cutting uploads in a temporary path of Distributed Computing Platform, so thatMapper reads.
As shown in Figure 5, in one embodiment, call distribution script described distributed application program is submitted toStep to Distributed Computing Platform also comprises following process:
Step S502, collects the configuration information of input by command line parameter.
Call distribution script, collect the configuration information of user's input by command line parameter, these configure letterBreath comprises the number of the mapper of above-mentioned user's appointment, the catalogue of input file list, output directory, andSetting and the User Defined of some and Distributed Computing Platform arrange etc.
Step S504, according to configuration information, to the mapper in Distributed Computing Platform and reducer differenceGenerate encapsulation script.
According to the configuration information of collecting, mapper and reducer are generated respectively to an encapsulation script, protectExist in local temporary files, for the configuration information of user's input is passed to the form of environmental varianceMapper and reducer. For example, under (SuSE) Linux OS, the encapsulation script of generation is inserted configuration informationEnter in the environmental variance list of mapper and reducer. Due to mainstream operation system (for example Windows,Linux, Mac operating system etc.) on all support environment variablees of application program, adopt environmental variance transmitThe configuration information of user's input, has more versatility.
Step S506, submits to Distributed Computing Platform by encapsulation script.
The encapsulation script generating in step S504, moves the necessary file of pending task together with other,Be submitted to together in the temporary path of Distributed Computing Platform. Other move the necessary literary composition of pending taskPart comprise user specify the local file that will upload (can specify by the parameter in distribution script) and with distributionThe bottom document that formula computing platform is relevant.
It should be noted that the flow process in flow process and the Fig. 5 in Fig. 4 can carry out simultaneously, also can carry outAfter complete any one of them flow process, carry out another one flow process.
The file road of in one embodiment, having recorded pending file in the pending listed files after cuttingFootpath. In this embodiment, the step of carrying out distributed application program under Distributed Computing Platform is specially: logicalCross Distributed Computing Platform and obtain the pending listed files after cutting, according to the file path of pending filePending file is processed to output result.
Concrete, the pending listed files after cutting and the encapsulation script of above-mentioned generation are submitted to distributed meterCalculate after platform, distribution of notifications formula computing platform starts pending task. Due to the pending literary composition after cuttingPart list is multiple texts, wherein every line item mapper need file path to be processed, pointIt is the text of every row that mapper in cloth formula computing platform obtains inputting, and every a line is distributed to mapperFile path to be processed, reducer receives the output data of mapper. Like this, limited mapperAlways file path line by line of input, mapper goes to file reading path pair after getting file pathThe file content of answering. Because the mode of different Distributed Computing Platform transmission input data is very much not different, thisSample design is unified the mode of input data, with respect to direct in traditional distributed computing methodFile content is passed to the mode of mapper; Meanwhile, framework does not limit pending file contentActual format, has farthest retained the flexibility of deal with data, thus realize cross-platform versatility andAvailability.
After pending file being processed by Distributed Computing Platform, export result. In taskAlso more exportable statistical informations, error message etc. after executing.
In one embodiment, as shown in Figure 6, a kind of distributed computing system, comprises platform package module102, application package module 104 and execution package module 106, wherein:
Platform package module 102, for the general utility functions based on different distributions formula computing platform, builds multiple pointsUnified DLL between cloth formula computing platform, and mutual with application package module 104 to unify DLL.
In one embodiment, as shown in Figure 2, based on the general character of Distributed Computing Platform, build multiple pointsUnified DLL between cloth formula computing platform, makes these DLLs have each Distributed Computing PlatformGeneral basic function (being also the most frequently used most important function conventionally). These DLLs comprise contextInterface, for resolve input data and generate output data interface, file access interface and input and output wordThrottling interface. The specific descriptions of DLL, with reference to above, repeat no more at this.
Application package module 104, for according to user's application demand, is programmed by unified DLL,Build distributed application program.
In the present embodiment, application package module 104, for application the demand concrete according to user, calls flatPlatform package module 102 completes concrete data processing business, programmes by unified DLL, and structureBuild a complete distributed application program.
Executive Module 106 is submitted to Distributed Computing Platform for calling distribution script by distributed application program,Start pending task, on Distributed Computing Platform, carry out distributed application program.
In one embodiment, the distributed application program generating, needs to rely on distribution script to be submitted toDistributed Computing Platform. Distribution script is provided by the present embodiment scheme, for providing unified to userThe submission task state of platform independence, is submitted to concrete Distributed Calculation by distributed application program and dataPlatform. The submission instruction of different Distributed Computing Platforms is widely different, and distribution script is by accepting one group of systemThe function choosing-item of one form, and translated into the instruction that concrete Distributed Computing Platform is corresponding and go to carry out,Thereby isolate the otherness of different distributions formula computing platform in the time of the task of submission to. Script is defined carries in distributionHand over task state all very basic, conventional, important, be equivalent to get the task of each Distributed Computing PlatformSubmit the common factor of function to.
In the time that Distributed Computing Platform is supported MapReduce normal form, call the main option that distribution script is supportedComprise the executable program of listed files that user inputs, output directory, mapper and option, reducerExecutable program and option etc., these options are all general on various MapReduce Distributed Computing PlatformsBe suitable for.
As shown in Figure 7, in one embodiment, carry out package module 106 and comprise listed files generation module116, cutting module 126, configuration information collection module 136, encapsulation script generation module 146 and processing mouldPiece 156, wherein:
Listed files generation module 116, for obtaining input file list, generates pending listed files.
For different Distributed Computing Platforms, need to adopt different orders to obtain pending file rowTable. In pending listed files, record the complete trails of each pending file, and with the shape of textFormula is kept in a local temporary files.
Cutting module 126 is for being listed as according to the pending file of the mapper number cutting of Distributed Computing PlatformTable, submits to Distributed Computing Platform by the pending listed files after cutting.
Mapper number can be specified by user, and the each file in the pending listed files of generation is wanted conductMapper is delivered in input, because user can specify multiple mapper parallel processings, therefore needs pendingListed files carries out cutting, to make the workload of each mapper as far as possible average.
In the present embodiment, carry out cutting by file number, for example, in pending listed files, have 10 and treatProcess file, user has specified 3 mapper, can be divided into (3,3,4), has two mapper respectivelyProcess 3 pending files, have a mapper to process 4 pending files. Pending literary composition after cuttingPart list is corresponding with each mapper, the number of the pending listed files after cutting and the number of mapperOrder is consistent, and the pending listed files after each cutting, as a temporary file, has wherein recorded eachThe path of the pending file of mapper. Pending listed files after cutting uploads to Distributed Computing PlatformA temporary path in so that mapper reads.
Configuration information collection module 136 is for collecting the configuration information of input by command line parameter.
Call distribution script, collect the configuration information of user's input by command line parameter, these configure letterBreath comprises the number of the mapper of above-mentioned user's appointment, the catalogue of input file list, output directory, andSetting and the User Defined of some and Distributed Computing Platform arrange etc.
Encapsulation script generation module 146 is for according to configuration information, to the mapper in Distributed Computing PlatformGenerate respectively encapsulation script with reducer, encapsulation script is submitted to Distributed Computing Platform.
According to the configuration information of collecting, mapper and reducer are generated respectively to an encapsulation script, protectExist in local temporary files, for the configuration information of user's input is passed to the form of environmental varianceMapper and reducer. For example, due to mainstream operation system (Windows, Linux, Mac operating systemDeng) on all support environment variablees of application program, adopt environmental variance to transmit the configuration information of user input,Have more versatility. The encapsulation script generating, moves the necessary file of pending task together with other,Be submitted to together in the temporary path of Distributed Computing Platform. Other move the necessary literary composition of pending taskPart comprise user specify the local file that will upload (can specify by the parameter in distribution script) and with distributionThe bottom document that formula computing platform is relevant.
Processing module 156 is for obtain the pending file after cutting by Distributed Computing Platform, according to treatingProcess the file path of file pending file is processed, output result.
Concrete, the pending listed files after cutting and the encapsulation script of above-mentioned generation are submitted to distributed meterCalculate after platform, distribution of notifications formula computing platform starts pending task. Due to the pending literary composition after cuttingPart list is multiple texts, wherein every line item mapper need file path to be processed, pointIt is the text of every row that mapper in cloth formula computing platform obtains inputting, and every a line is distributed to mapperFile path to be processed, reducer receives the output data of mapper. Like this, limited mapperAlways file path line by line of input, mapper goes to file reading path pair after getting file pathThe file content of answering. Because the mode of different Distributed Computing Platform transmission input data is very much not different, thisSample design is unified the mode of input data, with respect to direct in traditional distributed computing methodFile content is passed to the mode of mapper; Meanwhile, framework does not limit pending file contentActual format, has farthest retained the flexibility of deal with data, thus realize cross-platform versatility andAvailability.
After pending file being processed by Distributed Computing Platform, export result. In taskAlso more exportable statistical informations, error message etc. after executing.
It should be noted that distributed computing method provided by the present invention and system, be specially adapted to supportThe Distributed Computing Platform of MapReduce normal form, for the Distributed Computing Platform of other normal forms, also can adoptRealize by similar principles. Adopt above-mentioned distributed computing method and system, can improve different distributions formula and calculate flatPortability between platform, the Distributed Calculation that implementation platform is irrelevant.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed,But can not therefore be interpreted as the restriction to the scope of the claims of the present invention. It should be pointed out that for this areaThose of ordinary skill, without departing from the inventive concept of the premise, can also make some distortion andImprove, these all belong to protection scope of the present invention. Therefore, the protection domain of patent of the present invention should be with appendedClaim is as the criterion.