WO2008011012A3 - Recoverable error detection for concurrent computing programs - Google Patents
Recoverable error detection for concurrent computing programs Download PDFInfo
- Publication number
- WO2008011012A3 WO2008011012A3 PCT/US2007/016170 US2007016170W WO2008011012A3 WO 2008011012 A3 WO2008011012 A3 WO 2008011012A3 US 2007016170 W US2007016170 W US 2007016170W WO 2008011012 A3 WO2008011012 A3 WO 2008011012A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- barrier synchronization
- regions
- node
- concurrent computing
- synchronization point
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/522—Barrier synchronisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
Abstract
The present invention provides a system and method for detecting communication error among multiple nodes in a concurrent computing environment. A barrier synchronization point or regions are used to check for communication mismatch. The barrier synchronization can be placed anywhere in a concurrent computing program. If a communication error occurred before the barrier synchronization point, it would at least be detected when a node enters the barrier synchronization point. Once a node has reached the barrier synchronization point, it is not allowed to communicate with another node regarding data that is needed to execute the concurrent computing program, even if the other node has not reached the barrier synchronization point. Regions can also be used to detect a communication mismatch instead of barrier synchronization points. A concurrent program on each node is separated into one or more regions. Two nodes can only communicate with each other when their regions are compatible. If their regions are not compatible, then there is a communication mismatch.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP07810520.2A EP2049994B1 (en) | 2006-07-17 | 2007-07-17 | Recoverable error detection for concurrent computing programs |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/488,432 | 2006-07-17 | ||
US11/488,432 US7925791B2 (en) | 2006-07-17 | 2006-07-17 | Recoverable error detection for concurrent computing programs |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2008011012A2 WO2008011012A2 (en) | 2008-01-24 |
WO2008011012A3 true WO2008011012A3 (en) | 2008-06-26 |
Family
ID=38950570
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/016170 WO2008011012A2 (en) | 2006-07-17 | 2007-07-17 | Recoverable error detection for concurrent computing programs |
Country Status (3)
Country | Link |
---|---|
US (2) | US7925791B2 (en) |
EP (1) | EP2049994B1 (en) |
WO (1) | WO2008011012A2 (en) |
Families Citing this family (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7853639B2 (en) * | 2006-09-12 | 2010-12-14 | International Business Machines Corporation | Performing process migration with allreduce operations |
US7835284B2 (en) * | 2006-10-06 | 2010-11-16 | International Business Machines Corporation | Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by routing through transporter nodes |
US7680048B2 (en) * | 2006-10-06 | 2010-03-16 | International Business Machiens Corporation | Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamically adjusting local routing strategies |
US8031614B2 (en) * | 2006-10-06 | 2011-10-04 | International Business Machines Corporation | Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamic global mapping of contended links |
US7839786B2 (en) * | 2006-10-06 | 2010-11-23 | International Business Machines Corporation | Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by semi-randomly varying routing policies for different packets |
US8423987B2 (en) * | 2007-01-30 | 2013-04-16 | International Business Machines Corporation | Routing performance analysis and optimization within a massively parallel computer |
US7706275B2 (en) * | 2007-02-07 | 2010-04-27 | International Business Machines Corporation | Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by employing bandwidth shells at areas of overutilization |
US8296430B2 (en) | 2007-06-18 | 2012-10-23 | International Business Machines Corporation | Administering an epoch initiated for remote memory access |
US8370844B2 (en) * | 2007-09-12 | 2013-02-05 | International Business Machines Corporation | Mechanism for process migration on a massively parallel computer |
US20090080442A1 (en) * | 2007-09-26 | 2009-03-26 | Narayan Ananth S | Conserving power in a multi-node environment |
US9065839B2 (en) * | 2007-10-02 | 2015-06-23 | International Business Machines Corporation | Minimally buffered data transfers between nodes in a data communications network |
US20090113308A1 (en) * | 2007-10-26 | 2009-04-30 | Gheorghe Almasi | Administering Communications Schedules for Data Communications Among Compute Nodes in a Data Communications Network of a Parallel Computer |
US8055879B2 (en) * | 2007-12-13 | 2011-11-08 | International Business Machines Corporation | Tracking network contention |
DE102007061986A1 (en) * | 2007-12-21 | 2009-06-25 | Bayerische Motoren Werke Aktiengesellschaft | communication system |
US9225545B2 (en) | 2008-04-01 | 2015-12-29 | International Business Machines Corporation | Determining a path for network traffic between nodes in a parallel computer |
US8140704B2 (en) * | 2008-07-02 | 2012-03-20 | International Busniess Machines Corporation | Pacing network traffic among a plurality of compute nodes connected using a data communications network |
US8375367B2 (en) * | 2009-08-06 | 2013-02-12 | International Business Machines Corporation | Tracking database deadlock |
US8365186B2 (en) | 2010-04-14 | 2013-01-29 | International Business Machines Corporation | Runtime optimization of an application executing on a parallel computer |
US20170004668A1 (en) | 2010-06-30 | 2017-01-05 | Microsafe Sa De Cv | Cash container |
US8595312B2 (en) * | 2010-06-30 | 2013-11-26 | Microsafe Sa De Cv | Master device detecting or host computer detecting new device attempting to connect to controller area network (CAN) |
US8782434B1 (en) | 2010-07-15 | 2014-07-15 | The Research Foundation For The State University Of New York | System and method for validating program execution at run-time |
US8504730B2 (en) | 2010-07-30 | 2013-08-06 | International Business Machines Corporation | Administering connection identifiers for collective operations in a parallel computer |
US8949453B2 (en) | 2010-11-30 | 2015-02-03 | International Business Machines Corporation | Data communications in a parallel active messaging interface of a parallel computer |
US8565120B2 (en) | 2011-01-05 | 2013-10-22 | International Business Machines Corporation | Locality mapping in a distributed processing system |
US9317637B2 (en) | 2011-01-14 | 2016-04-19 | International Business Machines Corporation | Distributed hardware device simulation |
US8949328B2 (en) | 2011-07-13 | 2015-02-03 | International Business Machines Corporation | Performing collective operations in a distributed processing system |
US8689228B2 (en) | 2011-07-19 | 2014-04-01 | International Business Machines Corporation | Identifying data communications algorithms of all other tasks in a single collective operation in a distributed processing system |
US9250948B2 (en) | 2011-09-13 | 2016-02-02 | International Business Machines Corporation | Establishing a group of endpoints in a parallel computer |
DE102011084569B4 (en) * | 2011-10-14 | 2019-02-21 | Continental Automotive Gmbh | Method for operating an information technology system and information technology system |
US8930962B2 (en) | 2012-02-22 | 2015-01-06 | International Business Machines Corporation | Processing unexpected messages at a compute node of a parallel computer |
US20130247069A1 (en) * | 2012-03-15 | 2013-09-19 | International Business Machines Corporation | Creating A Checkpoint Of A Parallel Application Executing In A Parallel Computer That Supports Computer Hardware Accelerated Barrier Operations |
US9122873B2 (en) | 2012-09-14 | 2015-09-01 | The Research Foundation For The State University Of New York | Continuous run-time validation of program execution: a practical approach |
US9069782B2 (en) | 2012-10-01 | 2015-06-30 | The Research Foundation For The State University Of New York | System and method for security and privacy aware virtual machine checkpointing |
JP5994601B2 (en) * | 2012-11-27 | 2016-09-21 | 富士通株式会社 | Parallel computer, parallel computer control program, and parallel computer control method |
US9323714B2 (en) * | 2012-12-06 | 2016-04-26 | Coherent Logix, Incorporated | Processing system with synchronization instruction |
JP6335527B2 (en) * | 2014-01-28 | 2018-05-30 | キヤノン株式会社 | System, system control method, and computer program |
US9372766B2 (en) * | 2014-02-11 | 2016-06-21 | Saudi Arabian Oil Company | Circumventing load imbalance in parallel simulations caused by faulty hardware nodes |
EP3146426A4 (en) * | 2014-05-21 | 2018-01-03 | Georgia State University Research Foundation, Inc. | High-performance computing framework for cloud computing environments |
US9696927B2 (en) * | 2014-06-19 | 2017-07-04 | International Business Machines Corporation | Memory transaction having implicit ordering effects |
US9940242B2 (en) | 2014-11-17 | 2018-04-10 | International Business Machines Corporation | Techniques for identifying instructions for decode-time instruction optimization grouping in view of cache boundaries |
US9733940B2 (en) | 2014-11-17 | 2017-08-15 | International Business Machines Corporation | Techniques for instruction group formation for decode-time instruction optimization based on feedback |
FR3039347B1 (en) * | 2015-07-20 | 2017-09-08 | Bull Sas | METHOD FOR SAVING THE WORKING ENVIRONMENT OF A USER SESSION ON A SERVER |
US10783165B2 (en) * | 2017-05-17 | 2020-09-22 | International Business Machines Corporation | Synchronizing multiple devices |
US11175978B2 (en) * | 2019-07-31 | 2021-11-16 | International Business Machines Corporation | Detection of an error generated within an electronic device |
JP2021043737A (en) * | 2019-09-11 | 2021-03-18 | 富士通株式会社 | Barrier synchronization system, barrier synchronization method and parallel information processing device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6216174B1 (en) * | 1998-09-29 | 2001-04-10 | Silicon Graphics, Inc. | System and method for fast barrier synchronization |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4914657A (en) * | 1987-04-15 | 1990-04-03 | Allied-Signal Inc. | Operations controller for a fault tolerant multiple node processing system |
US6029205A (en) * | 1994-12-22 | 2000-02-22 | Unisys Corporation | System architecture for improved message passing and process synchronization between concurrently executing processes |
US6138140A (en) * | 1995-07-14 | 2000-10-24 | Sony Corporation | Data processing method and device |
US6336161B1 (en) * | 1995-12-15 | 2002-01-01 | Texas Instruments Incorporated | Computer configuration system and method with state and restoration from non-volatile semiconductor memory |
US5768538A (en) * | 1996-04-30 | 1998-06-16 | International Business Machines Corporation | Barrier synchronization method wherein members dynamic voting controls the number of synchronization phases of protocols and progression to each new phase |
US5987477A (en) * | 1997-07-11 | 1999-11-16 | International Business Machines Corporation | Parallel file system and method for parallel write sharing |
JP3571976B2 (en) * | 1999-11-08 | 2004-09-29 | 富士通株式会社 | Debugging apparatus and method, and program recording medium |
US6651242B1 (en) * | 1999-12-14 | 2003-11-18 | Novell, Inc. | High performance computing system for distributed applications over a computer |
US6834358B2 (en) * | 2001-03-28 | 2004-12-21 | Ncr Corporation | Restartable database loads using parallel data streams |
US7117248B1 (en) * | 2001-09-28 | 2006-10-03 | Bellsouth Intellectual Property Corporation | Text message delivery features for an interactive wireless network |
US7305582B1 (en) * | 2002-08-30 | 2007-12-04 | Availigent, Inc. | Consistent asynchronous checkpointing of multithreaded application programs based on active replication |
JP4276028B2 (en) * | 2003-08-25 | 2009-06-10 | 株式会社日立製作所 | Multiprocessor system synchronization method |
WO2006002076A2 (en) | 2004-06-15 | 2006-01-05 | Tekelec | Methods, systems, and computer program products for content-based screening of messaging service messages |
US7484160B2 (en) | 2005-03-04 | 2009-01-27 | Tellabs Operations, Inc. | Systems and methods for delineating a cell in a communications network |
US7779242B2 (en) * | 2005-12-22 | 2010-08-17 | International Business Machines Corporation | Data processing system component startup mode controls |
US20070174484A1 (en) | 2006-01-23 | 2007-07-26 | Stratus Technologies Bermuda Ltd. | Apparatus and method for high performance checkpointing and rollback of network operations |
US7796527B2 (en) | 2006-04-13 | 2010-09-14 | International Business Machines Corporation | Computer hardware fault administration |
US7610510B2 (en) * | 2007-02-16 | 2009-10-27 | Symantec Corporation | Method and apparatus for transactional fault tolerance in a client-server system |
-
2006
- 2006-07-17 US US11/488,432 patent/US7925791B2/en active Active
-
2007
- 2007-07-17 EP EP07810520.2A patent/EP2049994B1/en active Active
- 2007-07-17 US US11/879,383 patent/US8055940B2/en not_active Expired - Fee Related
- 2007-07-17 WO PCT/US2007/016170 patent/WO2008011012A2/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6216174B1 (en) * | 1998-09-29 | 2001-04-10 | Silicon Graphics, Inc. | System and method for fast barrier synchronization |
Non-Patent Citations (2)
Title |
---|
JOHNSON T A ET AL: "Cyclical cascade chains: a dynamic barrier synchronization mechanism for multiprocessor systems", PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM., PROCEEDINGS 15TH INTERNATIONAL SAN FRANCISCO, CA, USA 23-27 APRIL 2001, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 23 April 2001 (2001-04-23), pages 2061 - 2068, XP010544629, ISBN: 0-7695-0990-8 * |
KLAIBER A C ET AL: "A COMPARISON OF MESSAGE PASSING AND SHARED MEMORY ARCHITECTURES FORDATA PARALLEL PROGRAMS", PROCEEDINGS OF THE ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE. CHICAGO, APRIL 18 - 21, 1994, LOS ALAMITOS, IEEE COMP. SOC. PRESS, US, vol. SYMP. 21, 18 April 1994 (1994-04-18), pages 94 - 105, XP000480428, ISBN: 0-8186-5510-0 * |
Also Published As
Publication number | Publication date |
---|---|
WO2008011012A2 (en) | 2008-01-24 |
US7925791B2 (en) | 2011-04-12 |
EP2049994B1 (en) | 2016-03-30 |
US20090006621A1 (en) | 2009-01-01 |
US8055940B2 (en) | 2011-11-08 |
EP2049994A2 (en) | 2009-04-22 |
US20080016249A1 (en) | 2008-01-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008011012A3 (en) | Recoverable error detection for concurrent computing programs | |
WO2007009009A3 (en) | Systems and methods for identifying sources of malware | |
WO2007117567A3 (en) | Malware detection system and method for limited access mobile platforms | |
WO2009038818A3 (en) | System and method for providing network penetration testing | |
WO2014078585A3 (en) | Methods, systems and computer readable media for detecting command injection attacks | |
WO2007005524A3 (en) | Systems and methods for identifying malware distribution sites | |
WO2010118011A3 (en) | Invariants-based learning method and system for failure diagnosis in large scale computing systems | |
WO2007015711A3 (en) | Fault recovery for real-time, multi-tasking computer system | |
WO2009115957A3 (en) | Distributed spectrum sensing | |
WO2017064560A8 (en) | Centralized management of a software defined automation system | |
WO2014138205A3 (en) | Methods, systems, and computer readable media for detecting a compromised computing host | |
WO2008092162A3 (en) | Systems, methods, and media for recovering an application from a fault or attack | |
WO2006121990A3 (en) | Fault tolerant computer system | |
WO2007127764A3 (en) | Automated analysis of collected field data for error detection | |
WO2011056880A3 (en) | Rollback feature | |
WO2009020837A3 (en) | Synching data | |
NO20092482L (en) | System analysis and handling | |
ATE555430T1 (en) | SYSTEMS AND PROCEDURES FOR COMPUTER SECURITY | |
WO2008155188A3 (en) | Firewall control using remote system information | |
WO2008103286A3 (en) | Assessment and analysis of software security flaws | |
WO2008008765A3 (en) | Role-based access in a multi-customer computing environment | |
WO2011100600A3 (en) | Methods, systems and computer readable media for providing priority routing at a diameter node | |
WO2007008845A3 (en) | Fault tolerant gaming systems | |
WO2010042452A3 (en) | Machine learning for transliteration | |
WO2007101117A3 (en) | Systems and methods of network monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07810520 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007810520 Country of ref document: EP |