WO2008011012A3 - Recoverable error detection for concurrent computing programs - Google Patents

Recoverable error detection for concurrent computing programs Download PDF

Info

Publication number
WO2008011012A3
WO2008011012A3 PCT/US2007/016170 US2007016170W WO2008011012A3 WO 2008011012 A3 WO2008011012 A3 WO 2008011012A3 US 2007016170 W US2007016170 W US 2007016170W WO 2008011012 A3 WO2008011012 A3 WO 2008011012A3
Authority
WO
WIPO (PCT)
Prior art keywords
barrier synchronization
regions
node
concurrent computing
synchronization point
Prior art date
Application number
PCT/US2007/016170
Other languages
French (fr)
Other versions
WO2008011012A2 (en
Inventor
Edric Ellis
Jocelyn Luke Martin
Original Assignee
Mathworks Inc
Edric Ellis
Jocelyn Luke Martin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mathworks Inc, Edric Ellis, Jocelyn Luke Martin filed Critical Mathworks Inc
Priority to EP07810520.2A priority Critical patent/EP2049994B1/en
Publication of WO2008011012A2 publication Critical patent/WO2008011012A2/en
Publication of WO2008011012A3 publication Critical patent/WO2008011012A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/522Barrier synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Abstract

The present invention provides a system and method for detecting communication error among multiple nodes in a concurrent computing environment. A barrier synchronization point or regions are used to check for communication mismatch. The barrier synchronization can be placed anywhere in a concurrent computing program. If a communication error occurred before the barrier synchronization point, it would at least be detected when a node enters the barrier synchronization point. Once a node has reached the barrier synchronization point, it is not allowed to communicate with another node regarding data that is needed to execute the concurrent computing program, even if the other node has not reached the barrier synchronization point. Regions can also be used to detect a communication mismatch instead of barrier synchronization points. A concurrent program on each node is separated into one or more regions. Two nodes can only communicate with each other when their regions are compatible. If their regions are not compatible, then there is a communication mismatch.
PCT/US2007/016170 2006-07-17 2007-07-17 Recoverable error detection for concurrent computing programs WO2008011012A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07810520.2A EP2049994B1 (en) 2006-07-17 2007-07-17 Recoverable error detection for concurrent computing programs

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/488,432 2006-07-17
US11/488,432 US7925791B2 (en) 2006-07-17 2006-07-17 Recoverable error detection for concurrent computing programs

Publications (2)

Publication Number Publication Date
WO2008011012A2 WO2008011012A2 (en) 2008-01-24
WO2008011012A3 true WO2008011012A3 (en) 2008-06-26

Family

ID=38950570

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/016170 WO2008011012A2 (en) 2006-07-17 2007-07-17 Recoverable error detection for concurrent computing programs

Country Status (3)

Country Link
US (2) US7925791B2 (en)
EP (1) EP2049994B1 (en)
WO (1) WO2008011012A2 (en)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7853639B2 (en) * 2006-09-12 2010-12-14 International Business Machines Corporation Performing process migration with allreduce operations
US7835284B2 (en) * 2006-10-06 2010-11-16 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by routing through transporter nodes
US7680048B2 (en) * 2006-10-06 2010-03-16 International Business Machiens Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamically adjusting local routing strategies
US8031614B2 (en) * 2006-10-06 2011-10-04 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamic global mapping of contended links
US7839786B2 (en) * 2006-10-06 2010-11-23 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by semi-randomly varying routing policies for different packets
US8423987B2 (en) * 2007-01-30 2013-04-16 International Business Machines Corporation Routing performance analysis and optimization within a massively parallel computer
US7706275B2 (en) * 2007-02-07 2010-04-27 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by employing bandwidth shells at areas of overutilization
US8296430B2 (en) 2007-06-18 2012-10-23 International Business Machines Corporation Administering an epoch initiated for remote memory access
US8370844B2 (en) * 2007-09-12 2013-02-05 International Business Machines Corporation Mechanism for process migration on a massively parallel computer
US20090080442A1 (en) * 2007-09-26 2009-03-26 Narayan Ananth S Conserving power in a multi-node environment
US9065839B2 (en) * 2007-10-02 2015-06-23 International Business Machines Corporation Minimally buffered data transfers between nodes in a data communications network
US20090113308A1 (en) * 2007-10-26 2009-04-30 Gheorghe Almasi Administering Communications Schedules for Data Communications Among Compute Nodes in a Data Communications Network of a Parallel Computer
US8055879B2 (en) * 2007-12-13 2011-11-08 International Business Machines Corporation Tracking network contention
DE102007061986A1 (en) * 2007-12-21 2009-06-25 Bayerische Motoren Werke Aktiengesellschaft communication system
US9225545B2 (en) 2008-04-01 2015-12-29 International Business Machines Corporation Determining a path for network traffic between nodes in a parallel computer
US8140704B2 (en) * 2008-07-02 2012-03-20 International Busniess Machines Corporation Pacing network traffic among a plurality of compute nodes connected using a data communications network
US8375367B2 (en) * 2009-08-06 2013-02-12 International Business Machines Corporation Tracking database deadlock
US8365186B2 (en) 2010-04-14 2013-01-29 International Business Machines Corporation Runtime optimization of an application executing on a parallel computer
US20170004668A1 (en) 2010-06-30 2017-01-05 Microsafe Sa De Cv Cash container
US8595312B2 (en) * 2010-06-30 2013-11-26 Microsafe Sa De Cv Master device detecting or host computer detecting new device attempting to connect to controller area network (CAN)
US8782434B1 (en) 2010-07-15 2014-07-15 The Research Foundation For The State University Of New York System and method for validating program execution at run-time
US8504730B2 (en) 2010-07-30 2013-08-06 International Business Machines Corporation Administering connection identifiers for collective operations in a parallel computer
US8949453B2 (en) 2010-11-30 2015-02-03 International Business Machines Corporation Data communications in a parallel active messaging interface of a parallel computer
US8565120B2 (en) 2011-01-05 2013-10-22 International Business Machines Corporation Locality mapping in a distributed processing system
US9317637B2 (en) 2011-01-14 2016-04-19 International Business Machines Corporation Distributed hardware device simulation
US8949328B2 (en) 2011-07-13 2015-02-03 International Business Machines Corporation Performing collective operations in a distributed processing system
US8689228B2 (en) 2011-07-19 2014-04-01 International Business Machines Corporation Identifying data communications algorithms of all other tasks in a single collective operation in a distributed processing system
US9250948B2 (en) 2011-09-13 2016-02-02 International Business Machines Corporation Establishing a group of endpoints in a parallel computer
DE102011084569B4 (en) * 2011-10-14 2019-02-21 Continental Automotive Gmbh Method for operating an information technology system and information technology system
US8930962B2 (en) 2012-02-22 2015-01-06 International Business Machines Corporation Processing unexpected messages at a compute node of a parallel computer
US20130247069A1 (en) * 2012-03-15 2013-09-19 International Business Machines Corporation Creating A Checkpoint Of A Parallel Application Executing In A Parallel Computer That Supports Computer Hardware Accelerated Barrier Operations
US9122873B2 (en) 2012-09-14 2015-09-01 The Research Foundation For The State University Of New York Continuous run-time validation of program execution: a practical approach
US9069782B2 (en) 2012-10-01 2015-06-30 The Research Foundation For The State University Of New York System and method for security and privacy aware virtual machine checkpointing
JP5994601B2 (en) * 2012-11-27 2016-09-21 富士通株式会社 Parallel computer, parallel computer control program, and parallel computer control method
US9323714B2 (en) * 2012-12-06 2016-04-26 Coherent Logix, Incorporated Processing system with synchronization instruction
JP6335527B2 (en) * 2014-01-28 2018-05-30 キヤノン株式会社 System, system control method, and computer program
US9372766B2 (en) * 2014-02-11 2016-06-21 Saudi Arabian Oil Company Circumventing load imbalance in parallel simulations caused by faulty hardware nodes
EP3146426A4 (en) * 2014-05-21 2018-01-03 Georgia State University Research Foundation, Inc. High-performance computing framework for cloud computing environments
US9696927B2 (en) * 2014-06-19 2017-07-04 International Business Machines Corporation Memory transaction having implicit ordering effects
US9940242B2 (en) 2014-11-17 2018-04-10 International Business Machines Corporation Techniques for identifying instructions for decode-time instruction optimization grouping in view of cache boundaries
US9733940B2 (en) 2014-11-17 2017-08-15 International Business Machines Corporation Techniques for instruction group formation for decode-time instruction optimization based on feedback
FR3039347B1 (en) * 2015-07-20 2017-09-08 Bull Sas METHOD FOR SAVING THE WORKING ENVIRONMENT OF A USER SESSION ON A SERVER
US10783165B2 (en) * 2017-05-17 2020-09-22 International Business Machines Corporation Synchronizing multiple devices
US11175978B2 (en) * 2019-07-31 2021-11-16 International Business Machines Corporation Detection of an error generated within an electronic device
JP2021043737A (en) * 2019-09-11 2021-03-18 富士通株式会社 Barrier synchronization system, barrier synchronization method and parallel information processing device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6216174B1 (en) * 1998-09-29 2001-04-10 Silicon Graphics, Inc. System and method for fast barrier synchronization

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4914657A (en) * 1987-04-15 1990-04-03 Allied-Signal Inc. Operations controller for a fault tolerant multiple node processing system
US6029205A (en) * 1994-12-22 2000-02-22 Unisys Corporation System architecture for improved message passing and process synchronization between concurrently executing processes
US6138140A (en) * 1995-07-14 2000-10-24 Sony Corporation Data processing method and device
US6336161B1 (en) * 1995-12-15 2002-01-01 Texas Instruments Incorporated Computer configuration system and method with state and restoration from non-volatile semiconductor memory
US5768538A (en) * 1996-04-30 1998-06-16 International Business Machines Corporation Barrier synchronization method wherein members dynamic voting controls the number of synchronization phases of protocols and progression to each new phase
US5987477A (en) * 1997-07-11 1999-11-16 International Business Machines Corporation Parallel file system and method for parallel write sharing
JP3571976B2 (en) * 1999-11-08 2004-09-29 富士通株式会社 Debugging apparatus and method, and program recording medium
US6651242B1 (en) * 1999-12-14 2003-11-18 Novell, Inc. High performance computing system for distributed applications over a computer
US6834358B2 (en) * 2001-03-28 2004-12-21 Ncr Corporation Restartable database loads using parallel data streams
US7117248B1 (en) * 2001-09-28 2006-10-03 Bellsouth Intellectual Property Corporation Text message delivery features for an interactive wireless network
US7305582B1 (en) * 2002-08-30 2007-12-04 Availigent, Inc. Consistent asynchronous checkpointing of multithreaded application programs based on active replication
JP4276028B2 (en) * 2003-08-25 2009-06-10 株式会社日立製作所 Multiprocessor system synchronization method
WO2006002076A2 (en) 2004-06-15 2006-01-05 Tekelec Methods, systems, and computer program products for content-based screening of messaging service messages
US7484160B2 (en) 2005-03-04 2009-01-27 Tellabs Operations, Inc. Systems and methods for delineating a cell in a communications network
US7779242B2 (en) * 2005-12-22 2010-08-17 International Business Machines Corporation Data processing system component startup mode controls
US20070174484A1 (en) 2006-01-23 2007-07-26 Stratus Technologies Bermuda Ltd. Apparatus and method for high performance checkpointing and rollback of network operations
US7796527B2 (en) 2006-04-13 2010-09-14 International Business Machines Corporation Computer hardware fault administration
US7610510B2 (en) * 2007-02-16 2009-10-27 Symantec Corporation Method and apparatus for transactional fault tolerance in a client-server system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6216174B1 (en) * 1998-09-29 2001-04-10 Silicon Graphics, Inc. System and method for fast barrier synchronization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOHNSON T A ET AL: "Cyclical cascade chains: a dynamic barrier synchronization mechanism for multiprocessor systems", PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM., PROCEEDINGS 15TH INTERNATIONAL SAN FRANCISCO, CA, USA 23-27 APRIL 2001, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 23 April 2001 (2001-04-23), pages 2061 - 2068, XP010544629, ISBN: 0-7695-0990-8 *
KLAIBER A C ET AL: "A COMPARISON OF MESSAGE PASSING AND SHARED MEMORY ARCHITECTURES FORDATA PARALLEL PROGRAMS", PROCEEDINGS OF THE ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE. CHICAGO, APRIL 18 - 21, 1994, LOS ALAMITOS, IEEE COMP. SOC. PRESS, US, vol. SYMP. 21, 18 April 1994 (1994-04-18), pages 94 - 105, XP000480428, ISBN: 0-8186-5510-0 *

Also Published As

Publication number Publication date
WO2008011012A2 (en) 2008-01-24
US7925791B2 (en) 2011-04-12
EP2049994B1 (en) 2016-03-30
US20090006621A1 (en) 2009-01-01
US8055940B2 (en) 2011-11-08
EP2049994A2 (en) 2009-04-22
US20080016249A1 (en) 2008-01-17

Similar Documents

Publication Publication Date Title
WO2008011012A3 (en) Recoverable error detection for concurrent computing programs
WO2007009009A3 (en) Systems and methods for identifying sources of malware
WO2007117567A3 (en) Malware detection system and method for limited access mobile platforms
WO2009038818A3 (en) System and method for providing network penetration testing
WO2014078585A3 (en) Methods, systems and computer readable media for detecting command injection attacks
WO2007005524A3 (en) Systems and methods for identifying malware distribution sites
WO2010118011A3 (en) Invariants-based learning method and system for failure diagnosis in large scale computing systems
WO2007015711A3 (en) Fault recovery for real-time, multi-tasking computer system
WO2009115957A3 (en) Distributed spectrum sensing
WO2017064560A8 (en) Centralized management of a software defined automation system
WO2014138205A3 (en) Methods, systems, and computer readable media for detecting a compromised computing host
WO2008092162A3 (en) Systems, methods, and media for recovering an application from a fault or attack
WO2006121990A3 (en) Fault tolerant computer system
WO2007127764A3 (en) Automated analysis of collected field data for error detection
WO2011056880A3 (en) Rollback feature
WO2009020837A3 (en) Synching data
NO20092482L (en) System analysis and handling
ATE555430T1 (en) SYSTEMS AND PROCEDURES FOR COMPUTER SECURITY
WO2008155188A3 (en) Firewall control using remote system information
WO2008103286A3 (en) Assessment and analysis of software security flaws
WO2008008765A3 (en) Role-based access in a multi-customer computing environment
WO2011100600A3 (en) Methods, systems and computer readable media for providing priority routing at a diameter node
WO2007008845A3 (en) Fault tolerant gaming systems
WO2010042452A3 (en) Machine learning for transliteration
WO2007101117A3 (en) Systems and methods of network monitoring

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07810520

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

WWE Wipo information: entry into national phase

Ref document number: 2007810520

Country of ref document: EP