HPC Wire

Subscribe to HPC Wire feed
Since 1987 - Covering the Fastest Computers in the World and the People Who Run Them
Updated: 2 hours 42 min ago

BSC Participates in European Public Procurement of Innovation Solutions

Wed, 07/05/2017 - 08:30

BARCELONA, July 5, 2017 — Barcelona Supercomputing Center (BSC), together with four other leading European HPC centres, is participating for the first time in a market consultation for the purchase of HPC systems, as published today in the Official Journal of the European Union (OJEU). The Public Procurement of Innovative Solutions (PPI) involves four public procurers (BSC, CINECA, FZJ/JSC and GENCI) located in four different countries (Spain, Italy, Germany and France) working together in a joint procurement.This collaboration takes place within the Public Procurement of Innovative Solutions for High-Performance Computing (PPI4HPC) project, funded by the European Commission.

On 6 September 2017, an Open Dialogue Event (ODE) will take place in Brussels whose aim is to inform all interested suppliers about their expectations and plans, as well as to gather feedback from the market. The PPI is an administrative action to foster innovation, geared towards enhancing the development of new innovative markets from the demand side through the instrument of public procurement.

BSC Operations Director and PPI4HPC project lead at BSC, Sergi Girona, points out that “for the first time in Europe, a joint European procurement of innovative HPC systems will organise a meeting with IT companies and purchase the HPC technologies of the future. BSC, as one of the main supercomputing centres in Europe, will be part of this innovative project”.

BSC solution focused on compute and storage infrastructure

BSC will acquire a compute and storage infrastructure for high performance data analytics (HPDA), to be installed during the first half of 2019. The BSC HPDA infrastructure will combine compute nodes with innovative storage technologies such as NVRAM in conjunction with storage technologies already in place at BSC, such as standard hard drives and tape infrastructure, configured in a tiered storage solution to store hundreds of petabytes of scientific data. This solution will allow data to be supplied for BSC’s HPC and HPDA resources in the near future, during the pre-exascale period.

As for the compute element, a high performance for data analytics infrastructure will be acquired to complement HPC production clusters currently at BSC, which will pre-process simulation data with new big data or analytics paradigms and algorithms.

The BSC solution is thus expected to provide innovative technologies for massive data storage and tiering. For further information about the PPI, please visit the PPI4HPC project’s website.


A group of leading European supercomputing centres decided to form a buyers’ group to execute a joint Public Procurement of Innovative Solutions (PPI) for the first time in the area of high-performance computing (HPC). The co-funding by the European Commission (EC) will allow for a significant enhancement of the planned pre-exascale HPC infrastructure from 2019 and pave the way for future joint investments in Europe. The total investment planned will be about € 73 million. The HPC centres involved, – BSCCEA/GENCICINECA and JUELICH– have a strong track record in providing supercomputing resources at European level.


Source: BSC

The post BSC Participates in European Public Procurement of Innovation Solutions appeared first on HPCwire.

The Virtual Institute – High Productivity Supercomputing Celebrates 10th Anniversary

Wed, 07/05/2017 - 08:27

July 5, 2017 — The perpetual focus on hardware performance as a primary success metric in high-performance computing (HPC) often diverts attention from the role of people in the process of producing application output. But it is ultimately this output and the rate at which it can be delivered, in other words the productivity of HPC, which justifies the huge investments in this technology. However, the time needed to come up with a specific result or the “time to solution”, which it is often called, depends on many factors, including the speed and quality of software development. This is one of the solution steps where people play a major role. Obviously, their productivity can be enhanced with tools such as debuggers and performance profilers, which help them to find and eliminate errors or diagnose and improve performance.

Ten years ago, the Virtual Institute – High Productivity Supercomputing (VI-HPS) was created exactly with this goal in mind. Application developers should be able to focus on the science to accomplish instead of having to spend major portions of their time solving problems related to their software. With initial funding from the Helmholtz Association, the umbrella organization of the major national research laboratories in Germany, the institute was founded on the initiative of Forschungszentrum Jülich together with RWTH Aachen University, Technische Universität Dresden, and the University of Tennessee.

Since then, the members of the institute have developed powerful programming tools, in particular for the purpose of analyzing HPC application correctness and performance, which are today used across the globe. Major emphasis was given to the definition of common interfaces and exchange formats between these tools to improve the interoperability between them and lower their development cost. A series of international tuning workshops and tutorials taught hundreds of application developers how to use them. Finally, the institute organized numerous academic workshops to foster the HPC tools community and offer especially young researchers a forum to present novel program analysis methods. Today, the institute encompasses twelve member organizations from five countries.

One June 23rd, 2017, the institute celebrated its 10th anniversary at a workshop held in Seeheim, Germany. Anshu Dubey from Argonne National Laboratory, one of the keynote speakers, explained that in HPC usually all parts of the software are under research, an important difference to software development in many other areas, leading to an economy of incentives where pure development is often not appropriately rewarded. In his historical review, Felix Wolf from TU Darmstadt, the spokesman of VI-HPS, looked back on important milestones such as the bylaws introduced to cover the rapid expansion of VI-HPS taking place a few years ago. In another keynote, Satoshi Matsuoka from the Tokyo Institute of Technology /AIST, Japan highlighted the recent advances in artificial intelligence and Big Data analytics as well as the challenges this poses for the design of future HPC systems. Finally, all members of VI-HPS presented their latest productivity-related research and outlined their future strategies.

Workshop website:

Source: The Virtual Institute – High Productivity Supercomputing

The post The Virtual Institute – High Productivity Supercomputing Celebrates 10th Anniversary appeared first on HPCwire.

Intersect360 Survey Shows Continued InfiniBand Dominance

Tue, 07/04/2017 - 08:48

There were few surprises in Intersect360 Research’s just released report on interconnect use in HPC. InfiniBand and Ethernet remain the dominant protocols across all segments (system, storage, LAN) and Mellanox and Cisco lead the supplier pack. The big question is when or if Intel’s Omi-Path fabric will break through. Less than one percent of the sites surveyed (system and storage interconnect) reported using Omni-Path.

“Although this share trails well behind Mellanox, Intel will move toward the integration of its interconnect technologies into its processor roadmap with the introduction of Omni-Path. This potentially changes the dynamics of the system fabric business considerably, as Intel may begin to market the network as a feature extension of the Intel processing environment,” according to the Intersect360 Research report.

“For its part, Mellanox has preemptively responded both strategically and technologically, surrounding itself with powerful partners in the OpenPOWER Foundation and coming to market with features such as multi-host technologies, which argue for keeping host-bus adapter technology off-chip.”

Indeed Mellanox and Intel have waged a war of words over the past two years surrounding the direction of network technology and the merits of off-loading networking instructions from the host CPU and distributing more processing power throughout the network. Of course, these are still early days for Omni-Path. Battling benchmarks aside, InfiniBand remains firmly entrenched at the high end although 100 Gigabit Ethernet is also gaining attraction.

It is important to note the Intersect360 data is from its 2016+ HPC site survey. This most recent and ninth consecutive Intersect360 survey was conducted in the second and third quarter of 2016 and received responses from 240 sites. Combined with entries from the prior two surveys, 487 HPC sites are represented in Site Census reports in 2016. In total, 474 sites reported interconnect and network characteristics for 723 HPC systems, 633 storage systems, and 638 LANs. The next survey should provide stronger directional guidance for Omni-Path.

Among key highlights from the report are:

  • Over 30% of system interconnect and LAN installations reported using 1 Gigabit Ethernet. “We believe that these slower technologies are often used as secondary administrative connections on clusters, and as primary interconnect for small throughput-oriented clusters. In the LAN realm, we see Gigabit Ethernet as still in use for smaller organizations and/or for subnetworks supporting departments/workgroups within larger organizations. Still, the tenacity of this technology surprises us.”
  • About 72% of Gigabit Ethernet was mentioned as a secondary interconnect, not primary. Gigabit Ethernet comes on many systems as a standard cluster interconnect, contributing to its high use in distributed memory systems.
  • InfiniBand continues to be the preferred high-performance system interconnect. “If we exclude Ethernet 1G (dubbed high-performance interconnect), installations of InfiniBand are about two times the combined installations of Ethernet (10G, 40G, and 100G). Within InfiniBand installations, InfiniBand 40G continues to be the most installed. However, for systems acquired since 2014, InfiniBand 56G is the most popular choice for systems.”
  • Ten Gigabit Ethernet is used more for storage and LAN installations than any other protocol. Installations of 10 Gigabit Ethernet account for 35% of all storage networks reported and 35% of all LANs reported. InfiniBand has been gradually increasing its share of storage networks, increasing from to 34% from 31% with almost all of this coming from InfiniBand 56G.
Figure 3 provides a visual of the transition to higher speeds for InfiniBand system interconnects for systems reported since our very first Site Census survey in 2008. For systems acquired in 2007, we saw about equal distribution between InfiniBand 10G and 20G. By 2009, more systems reported using InfiniBand 40G. InfiniBand 40G accounted for majority share of systems until 2014 when InfiniBand 56G took over as the primary installed system interconnect. The transition is fast with the latest performance leader accounting for the majority of shipments within two to three years of availability.

Two main drivers of the overall market, reports Intersect360, are 1) the growth in data volume and stress it puts on interconnect along and 2) a persistent “if it’s not broke don’t fix it” attitude with regard to with switching to new technologies. Ethernet is benefiting from the latter.

Parallelization of code is another major influence. “Architecting interconnects for parallel applications performance has long been a major concern for MPP systems which are built around proprietary interconnects, and supercomputer-class clusters which tend to use the fastest general-purpose network technology. We believe that the trend towards greater application parallelization at all levels will drive requirements for network performance down market into high-end and midrange computing configurations,” according to Intersect360.

The report is best read in full: Here’s a brief excerpt from the report’s conclusion:

“The transition to the latest or faster interconnect appears to be occurring at about the same rate as the life cycle of servers – every two to three years. With each system refresh, the latest or best price/performance interconnect is chosen. Ultimately, though, application needs drive what system performance requirements are needed. The cost of components limit the rate of adoption. Our data suggests most academic and government sites, along with some of the commercial sites, particularly energy, large manufacturing, and bio-science sites, value the performance of InfiniBand for system interconnects. Many of the applications in these areas support and users leverage multi-processing, GPUs, and multi-core architectures.”

Perhaps not surprising, Mellanox was the top supplier for system interconnects (42% of mentions) and storage networks (35% of mentions) – in fact Intersect360 reports Mellanox gained market share in all segments its 2015 showing. Cisco continues to be the leading supplier for the LAN market, with 46% of the mentions, according to Intersect360.

Link to report summary: http://www.intersect360.com/industry/reports.php?id=149

The post Intersect360 Survey Shows Continued InfiniBand Dominance appeared first on HPCwire.

NEC Accelerates Machine Learning for Vector Computers

Mon, 07/03/2017 - 14:39

TOKYO, July 3 — NEC Corporation today announced that it has developed data processing technology that accelerates the execution of machine learning on vector computers by more than 50 times in comparison to Spark technologies (*1).

This newly developed data processing utilizes computing and communications technologies that leverage “sparse matrix” data structures in order to significantly accelerate the performance of vector computers in machine learning.

Furthermore, NEC developed middleware that incorporates sparse matrix structures in order to simplify the use of machine learning. As a result, users are able to easily launch this middleware from Python or Spark infrastructures, which are commonly used for data analysis, without special programming.

“This technology enables users to quickly benefit from the results of machine learning, including the optimized placement of web advertisements, recommendations, and document analysis,” said Yuichi Nakamura, General Manager, System Platform Research Laboratories, NEC Corporation. “Furthermore, low-cost analysis using a small number of servers enables a wide range of users to take advantage of large-scale data analysis that was formerly only available to large companies.”

NEC’s next-generation vector computer (*2) is being developed to flexibly meet a wide range of price and performance needs. This data processing technology expands the capabilities of next-generation vector computers to include large-scale data analysis, such as machine learning, in addition to numerical calculation, the conventional specialty of vector computers.

NEC will introduce this technology on July 5 at the International Symposium on Parallel and Distributed Computing 2017 (ISPDC-2017) held in Innsbruck, Austria, from Monday, July 3 to Thursday, July 6. For more information on the ISPDC-2017, please visit the following link:http://ispdc2017.dps.uibk.ac.at/


*1)Spark is a distributed processing infrastructure developed by the Apache Software Foundation for open-source software that is used in clusters connecting multiple servers.
*2)NEC begins developing next-generation vector supercomputer

Source: NEC

The post NEC Accelerates Machine Learning for Vector Computers appeared first on HPCwire.

Atmospheric Data Solutions Taps PSSC Labs to Provide HPC Clusters for Weather Modeling

Mon, 07/03/2017 - 14:36

LAKE FOREST, Calif., July 3 — PSSC Labs, a developer of custom high performance computing (HPC) and big data computing solutions, today announced it is working with Atmospheric Data Solutions, LLC (ADS) to provide powerful, turn-key HPC Cluster solutions for its weather modeling solutions.

Atmospheric Data Solutions works with various public and private agencies, including major utility providers, to develop atmospheric science products that help mitigate and manage risk from severe weather and future climate change. The weather modeling solutions that ADS create include high impact weather forecast guidance products, tailored regional wildfire forecast guidance products, and utility load and outage forecasts – all requiring analysis of a large quantity of data that demands high performance computing to maximize accuracy and maximize the number of times models can be run daily.

PSSC Labs will work with ADS to provide powerful and customized supercomputing solutions for their weather modeling products, maximizing performance while staying within the budgetary constraints of each organization utilizing the end product. In addition to deploying PSSC Lab’s PowerWulf Clusters, ADS works with PSSC Labs to ensure the installation of custom modeling software on all HPC solutions, providing a truly turn key solution that is delivered ready to use.

The PowerWulf Cluster consists of 768 Intel Xeon Processor Cores, 4 Nvidia Tesla GPU Adapters, 2.1 TB System Memory, and 40TB+ Storage, all connected via Mellanox InfiniBand Interconnects, with additional configurations available. The PowerWulf Cluster includes PSSC Labs’ CBeST Cluster Management Toolkit to simplify the management, monitoring and maintenance. PSSC Labs will continue to support the HPC Cluster by providing operating system upgrades and continued system maintenance.

“PSSC Labs was accommodating every step of the way, whether it was finding the best hardware configuration within our client’s budget or allowing our own engineers on site to work on the clusters before delivery, “said Scott Capps, Principal and Founder of ADS. “The result is that our clients can now run models four times a day, as opposed to only twice a day with previous HPC set ups, with the results from the models delivered faster as well.”

PSSC Labs’ PowerWulf HPC Cluster offer a reliable, flexible, high performance computing platform for a variety of applications in the following verticals: Design & Engineering, Life Sciences, Physical Science, Financial Services and Machine/Deep Learning.

Every PowerWulf HPC Cluster includes a three-year unlimited phone / email support package (additional year support available) with all support provided by their US based team of experienced engineers. Prices for a custom built PowerWulf HPC Cluster solution start at $20,000.  For more information see http://www.pssclabs.com/solutions/hpc-cluster/

About PSSC Labs

For technology powered visionaries with a passion for challenging the status quo, PSSC Labs is the answer for hand-crafted HPC and Big Data computing solutions that deliver relentless performance with the absolute lowest total cost of ownership.  All products are designed and built at the company’s headquarters in Lake Forest, California. For more information, 949-380-7288, www.pssclabs.comsales@pssclabs.com.

Source: PSSC Labs

The post Atmospheric Data Solutions Taps PSSC Labs to Provide HPC Clusters for Weather Modeling appeared first on HPCwire.

‘Qudits’ Join the Strange Zoo of Quantum Computing

Mon, 07/03/2017 - 12:57

By now the sheer repetition of the term qubit has made it seem comprehensible and quantum computing not so strange. Brace yourself. Here comes the ‘qudit’ – another form of quantum information but one that is able to assume very many values at once.

“Instead of creating quantum computers based on qubits that can each adopt only two possible options, scientists have now developed a microchip that can generate “qudits” that can each assume 10 or more states, potentially opening up a new way to creating incredibly powerful quantum computers, a new study finds,” writes Charles Choi for the IEEE Spectrum.

Choi’s article, ‘Qudits: The Real Future of Quantum Computing?’ was posted last Friday and briefly examines work published at the same time in Nature, ‘On-chip generation of high-dimensional entangled quantum states and their coherent control,’ suggesting a way to create these multi-dimensional qudits.

Here’s brief excerpt from the IEEE Spectrum article:

“Now scientists have for the first time created a microchip that can generate two entangled qudits each with 10 states, for 100 dimensions total, more than what six entangled qubits could generate. “We have now achieved the compact and easy generation of high-dimensional quantum states,” says study co-lead author Michael Kues, a quantum optics researcher at Canada’s National Institute of Scientific Research, or INRS, its French acronym, in Varennes, Quebec.

“The researchers developed a photonic chip fabricated using techniques similar to ones used for integrated circuits. A laser fires pulses of light into a micro-ring resonator, a 270-micrometer-diameter circle etched onto silica glass, which in turn emits entangled pairs of photons. Each photon is in a superposition of 10 possible wavelengths or colors.

“For example, a high-dimensional photon can be red and yellow and green and blue, although the photons used here were in the infrared wavelength range,” Kues says. Specifically, one photon from each pair spanned wavelengths from 1534 to 1550 nanometers, while the other spanned from 1550 to 1566 nanometers.”

So just when your head stopped spinning at the sound of the word qubit, along comes the qudit. In fairness, the IEEE article points out scientists have long known about the possibility of using qudits and notes, “A quantum computer with 300 qubits could perform more calculations in an instant than there are atoms in the known universe, solving certain problems much faster than classical computers. In principle, a quantum computer with two 32-state qudits would be able to perform as many operations as 10 qubits while skipping the challenges inherent with working with 10 qubits together.”

The feature image is of the microchip fabricated by the researchers. Below is a diagram (Nature) of the work.

Researchers used the setup pictured above to create, manipulate, and detect qudits. The experiment starts when a laser fires pulses of light into a micro-ring resonator, which in turn emits entangled pairs of photons. Because the ring has multiple resonances, the photons have optical spectrums with a set of evenly spaced frequencies (red and blue peaks), a process known as spontaneous four-wave mixing (SFWM). The researchers were able to use each of the frequencies to encode information, which means the photons act as qudits. Each qudit is in a superposition of 10 possible states, extending the usual binary alphabet (0 and 1) of quantum bits. The researchers also showed they could perform basic gate operations on the qudits using optical filters and modulators, and then detect the results using single-photon counters.

Link to IEEE Spectrum article: http://spectrum.ieee.org/tech-talk/computing/hardware/qudits-the-real-future-of-quantum-computing

Link to Nature paper: https://www.nature.com/articles/nature22986.epdf?referrer_access_token=m2Cde8lf2Zh2R9vqdRitfdRgN0jAjWel9jnR3ZoTv0PJityhJkSWpq1THf-VSsArUhH5B2sAknySsan793cm3_eBBo9MOlyHeYxjGaqZnurhzcH7meLV3MMg5Q5-D4vlMlU-NCaRIE4XBnNREmU0z1WU8YYGcro3-m56ZnOv-djeJfdioz8743j4LAE5I8vkMm6oc8W8_hmdFSbxIjbVWNw4YvBWh0_Ct8hYflCuOY38KpBEFFTmoncxMDjN8a7vpt_r52ScoN43wj4CEhpr7A%3D%3D&tracking_referrer=spectrum.ieee.org

The post ‘Qudits’ Join the Strange Zoo of Quantum Computing appeared first on HPCwire.

Optimizing Codes for Heterogeneous HPC Clusters Using OpenACC

Mon, 07/03/2017 - 07:00

Looking at the Top500 and Green500 ranks, one clearly realizes that most HPC systems are heterogeneous architectures using COTS (Commercial Off-The-Shelf) hardware, combining traditional multi-core CPUs with massively parallel accelerators, such as GPUs and MICs.

With processor frequencies now hitting a solid wall, the only truly open avenue for riding Moore’s law today is increasing hardware parallelism in several different ways: more computing nodes, more processors in each node, more cores within each processor, and longer vector instructions in each core. This trend means that applications must learn to use all these levels of hardware parallelism efficiently if we want to see performance measured at the application level growing consistently with hardware performance. Adding to this complexity, single computing nodes adopt different architectures, with multi-core CPUs supporting different instruction-sets, vector lengths and caches organizations. Also GPUs provided by different vendors have different architectures in terms of number of cores, caches organization, etc. For code developers the current goal is to map all the parallelism available at application level onto all hardware resources using architecture-oblivious approaches targeting portability at both level of code and performance across different architectures.

Several programming languages and frameworks try to tackle the different levels of parallelism available in hardware systems, but most of them are not portable across different architectures. As an example, GPUs are largely used for scientific HPC applications because a reasonable compromise of easy programmability and performance has been made possible by ad-hoc proprietary languages (e.g., CUDA for Nvidia GPUs), but these languages are by definition not portable to different accelerators. Several open-standard languages have tried to address this problem (e.g., OpenCL), targeting in principle multiple architectures, but the lack of support from various vendors has limited their usefulness.

The need to exploit the computing power of these systems in conjunction with the lack of standardization in their hardware and/or programming frameworks raised new issues for software development strongly impacting software maintainability, portability and performance. The use of proprietary languages targeting specific architectures, or open-standard languages not embraced by all vendors, often led to multiple implementations of the same code to target different architectures. For this reason there are several implementations for various scientific codes, e.g., MPI plus OpenMP and C/C++ to target CPU based clusters; MPI plus CUDA to target Nvidia GPU based clusters; or MPI plus OpenCL for AMD GPU based clusters.

The developers who pursued this strategy soon realized that maintaining multiple versions of the same code is very expensive. This is even worst for scientific software development, since it is often characterized by frequent code modifications, by the need of a strong optimization from the performance point of view, and also by a long software lifetime, which may span tens of years. Ideally, a programming language for scientific HPC applications should be portable  across most of the current architectures, allow applications to run efficiently, and moreover it should enable to run on future architecture without requiring a complete code rewrite.

Directives based programming models try to address exactly this problem, abstracting parallel programming to a descriptive level, where programmers help the compiler to identify parallelism in the code, as opposite to a prescriptive level, where programmers must specify how the code should be mapped onto the hardware of the target machine.

OpenMP (Open Multi-Processing) is probably the most common of such programming models, already used by a wide scientific community, but initially it was not designed to support accelerators. To fill this gap, in  November 2011, a new standard named OpenACC (Open Accelerators) was proposed by Cray, PGI, Nvidia, and CAPS. OpenACC is a programming standard for parallel computing allowing programmers to annotate C, C++ or Fortran codes to suggest to the compiler parallelizable regions to be offloaded to a generic accelerator.

Both OpenMP and OpenACC are based on directives: OpenMP was introduced to manage parallelism on traditional multi-core CPUs, while OpenACC was initially developed trying to fulfill the missing accelerators support in OpenMP. Today these two frameworks are converging and extending their scope to cover a large subset of HPC architectures: OpenMP version 4.0 has been designed to support also code offloading to accelerators, while compilers supporting OpenACC (such as PGI or GCC) are starting to use the same directives to target also multi-core CPUs.

“First as a member of the Cray technical staff and now as a member of the Nvidia technical staff, I am working to ensure that OpenMP and OpenACC move towards parity whenever possible,”  said James Beyer, Co-chair OpenMP accelerator sub-committee and OpenACC technical committee.

Back in 2014 our research group at the University of Ferrara in collaboration with the Theoretical Physics group of the University of Pisa, started the development of a Lattice QCD Monte Carlo application, aiming to make it portable onto different heterogeneous HPC systems. This kind of simulation, from the computational point of view, executes mainly stencil operations performing complex vector-matrix multiplications on a 4-dimensional lattice.

At the time we were using two different versions developed within the Pisa group: a C++ implementation targeting CPU based clusters and a C++/CUDA implementation targeting Nvidia GPU based clusters. Maintaining the two different versions was particularly expensive, so the availability of a language such as OpenACC offered the interesting possibility to move towards a single portable implementation. The main interest was towards GPU based clusters, but we also aimed to target other architectures like the Intel Knights Landing (KNL, not available yet at the time).

We started this project coming from an earlier experience of porting a similar application to OpenCL, which although being an open-standard, ceased later to be supported on Nvidia GPUs, forcing us to completely rewrite the application. From this point of view a directive-based OpenACC code provides some additional amount of safeguard, as, when ignoring directives, it is still a perfectly working plain C, C++ or Fortran code, which can be “easily” re-annotated using other directives and run on other architectures.

Although decorating a code with directives seems a straightforward operation requiring minimal programming efforts, this is often not enough if performance portability is required in addition to just code portability.

Just to mention one issue, memory data layout has a strong impact on performances with different architectures and this design step is critical in implementing of new codes, as changing data layout at a later stage is seldom a viable option. The two C++ and CUDA versions we were starting from diverged exactly in the data-layout used to store the lattice: we had an AoS (Array of Structure) structure for the CPU-optimized version and an SoA (Structure of Array) layout for GPUs.

We started porting the computationally more intensive kernel of the full code, the so-called Dirac Operator, to plain C, annotating it with OpenACC directives, and developed a first benchmark. This benchmark was used to evaluate possible performance drawbacks associated to an architecture-agnostic implementation. It provided very useful information on the performance impact of different data layouts; we were happy to learn that the Structure of Arrays (SoA) memory data layout is preferred when using GPUs, but also when using modern CPUs, if vectorization is enforced. This stems from the fact that the SoA format allows vector units to process many sites of the application domain (the lattice, in our case) in parallel, favoring architectures with long vector units (e.g. with wide SIMD instructions). Modern CPUs tend to have longer and longer vector units and we expect this trend to continue in the future. For this reason, data structures related to the lattice in our code were designed to follow the SoA paradigm.

Since at that time no OpenACC compiler for CPU was able to use vector instructions, we replaced OpenACC directives with OpenMP ones and compiled the code using the Intel Compiler. Table 1 shows the results of this benchmark.

After this initial benchmark, further development iterations led to a full implementation of the complete Monte Carlo code annotated with OpenACC directives and portable across several architectures. To give an idea of the level of performance portability, we report in Table 2 the execution times of the Dirac operator, compiled by the PGI 16.10 compiler (which now also targets multi-core CPUs) on a variety of architectures: Haswell and Broadwell Intel CPUs, the W9100 AMD GPU and Kepler and Pascal Nvidia GPUs.

Concerning code portability, we have shown that the same user-grade code implementation runs  on an interesting variety of state-of-the-art architectures. As we focus on  performance portability, some issues are still present. The Dirac operator is strongly memory-bound, so both Intel CPUs should be roughly three times slower than Kepler GPUs, corresponding to their respective memory  bandwidths (about 70GB/s vs. 240GB/s); what we measure is that  performance is approximately 10 times worse on  the Haswell CPU than on one K80 GPU. The Broadwell CPU runs approximately two times faster than the Haswell CPU, at least for some lattice sizes, but still does not reach the memory-limit. We have identified two main reasons for this non-optimal behavior, and both of them point to some still immature features of the PGI compiler when targeting x86 architectures:

  • Parallelization: when encountering nested loops, the compiler splits the outer-loop across different threads, while inner loops are executed serially or vectorized within each thread. Thus, in this implementation, the 4-nested loops over the 4 lattice dimensions cannot be efficiently divided in a sufficiently large number of threads to exploit all the available cores of modern CPUs.
  • Vectorization: as reported by the compilation logs, the compiler fails to vectorize the Dirac operator. To verify if this is related to how we have coded these functions, we have translated the OpenACC directives into the corresponding OpenMP ones, without changing the C code, and compiled using the Intel compiler (version 17.0.1). In this case the compiler succeeds in vectorizing the function, running a factor 2 faster.

Also concerning the AMD GPUs, performance is worse than expected and the compiler is not yet sufficiently stable (we had erratic compiler crashes). To make things even worse, we found that the support for this architecture has been dropped by  the PGI compiler (16.10 is the last version supporting AMD devices) and thus if no other compilers appear in the market, running OpenACC applications on AMD GPUs will not be easy in the future.

On the other hand, for Nvidia GPUs, performance results are similar to the ones obtainable by our previous CUDA implementation, showing a maximum performance drop of 25 percent for the full simulation code, only in some particular simulation conditions.

In conclusion, a portable implementation of a full Monte Carlo LQCD simulation is now in production on CPU and GPU clusters. The code runs efficiently on Nvidia GPUs, while performance on Intel CPUs could still be improved. We are confident that future releases of the PGI compiler will be able to fill the gap. Finally, we are able to run also on AMD GPUs, but for this architecture compiler support is an open issue with little hope for the future. In the near future we look forward to testing our code on the Intel KNL, as soon as a reasonably stable official PGI support for that processor becomes available. As a final remark we have shown that translating OpenACC codes to OpenMP and vice-versa is a reasonably easy task, so, whichever the winner, we see a nice future for our application.


Claudio Bonati, INFN and University of Pisa
Simone Coscetti, INFN Pisa
Massimo D’Elia, INFN and University of Pisa
Michele Mesiti, INFN and University of Pisa
Francesco Negro, INFN Pisa
Enrico Calore, INFN and University of Ferrara
Sebastiano Fabio Schifano, INFN and University of Ferrara
Giorgio Silvi, INFN and University of Ferrara
Raffaele Tripiccione, INFN and University of Ferrara

The post Optimizing Codes for Heterogeneous HPC Clusters Using OpenACC appeared first on HPCwire.

Baidu Announces Inaugural AI Developer Conference “Baidu Create”

Fri, 06/30/2017 - 08:14

BEIJING, June 27, 2017 — Baidu Inc. (NASDAQ:BIDU), the Chinese language Internet search provider, announced today that it will hold its first artificial intelligence (AI) developer conference, “Baidu Create,” at the China National Convention Center (CNCC) in Beijing on July 5. This inaugural event will convene Baidu executives and engineers as well as developers, and experts across the AI industry.

The one-day event, expected to draw more than 4,000 participants, will include keynote speeches by Baidu’s executives and AI leadership, including Chairman and CEO Robin Li and Group President and COO Qi Lu, who will speak on corporate vision, technology breakthroughs, new partnerships, and the open platforms Baidu has available to developers.

Afternoon breakout sessions will feature Baidu’s AI leaders and scientists who will discuss the company’s latest technical progress across the following themes: AI technology and open platforms, conversational devices, intelligent driving, cloud, web ecosystems, and data centers.

At the conference, Baidu is expected to unveil several exciting announcements, including key partnerships for its recently announced open autonomous driving platform, Project Apollo, and a roadmap to open its technology and capabilities to partners and developers. With open-sourced platforms like Project Apollo, Baidu aims to build a collaborative ecosystem with the goal to accelerate the development and popularization of AI applications.

The livestream for the conference will be available with English audio beginning at 10:00 am Beijing Time on July 5 (7:00 pm PDT on July 4) on the event website. For more information on the event agenda and logistics, please visit http://create.baidu.com.

About Baidu

Baidu, Inc. is the leading Chinese language Internet search provider. Baidu aims to make a complicated world simpler for users and enterprises through technology. Baidu’s ADSs trade on the NASDAQ Global Select Market under the symbol “BIDU.” Currently, ten ADSs represent one Class A ordinary share.

Source: Baidu

The post Baidu Announces Inaugural AI Developer Conference “Baidu Create” appeared first on HPCwire.

Nallatech, A Molex Company, Joins Dell EMC Technology Partner Program

Fri, 06/30/2017 - 07:55

LISLE, Ill., June 30, 2017 – Nallatech, a Molex company, recently announced its official membership in the multi-tier Dell EMC Technology Partner Program that includes ISVs, IHVs and Solution Providers.  Nallatech provides hardware, software and design services to enable high performance computing (HPC), network processing, and real-time embedded computing in datacenters.  Dell EMC has approved and validated a range of Nallatech FPGA-based products designed for HPC applications for integration into several Dell Server platforms.

“The Dell EMC Technology Partner Program helps de-risk the qualification, deployment and operation of heterogeneous platforms featuring FPGA accelerators,” says Craig Petrie, VP Business Development FPGA Solutions, Nallatech. “For our customers, this translates into a simplified roll-out of advanced datacenter and private cloud platforms at a lower level of investment”.

The global Dell EMC Technology Partner Program builds innovative and competitive business solutions using Dell EMC platforms.  The program targets resources to help keep customer costs low and sustain competitiveness.  A structured and streamlined process combines technology and business strategies with Dell EMC Solution Center expertise to onboard and test partner products.  Rigorous testing helps to ensure that Nallatech solutions have met the technical requirements to optimize performance on Dell EMC platforms.

Nallatech FPGA Accelerated Compute Node datacenter solutions drive some of the most demanding HPC, data visualization and rendering workloads. The company has deployed several of the largest FPGA hybrid compute clusters.  In 2016, Nallatech released a range of Altera Arria 10 based boards for computational acceleration.  The new product family integrates a powerhouse of technologies designed to create a reliable, extremely fast and scalable server ideal for the HPC environment.  Assuring quick delivery and deployment, the optimized Nallatech 385A and 510T FPGA OpenCL Accelerators are among the products certified by the Dell EMC Technology Partner Program to run on Dell EMC platforms.

For more information about Nallatech and Dell, please visit http://www.nallatech.com/dell-technology-partner-program.

About Nallatech

Nallatech is a leading supplier of FPGA accelerated computing solutions. Since 1993, Nallatech has provided hardware, software and design services to enable customer’s success in applications including high performance computing, network processing, and real-time embedded computing.  For more information, please visit www.nallatech.com.

About Molex

Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, medical, industrial, automotive, and commercial vehicle. For more information, please visit www.molex.com.

Source: Molex

The post Nallatech, A Molex Company, Joins Dell EMC Technology Partner Program appeared first on HPCwire.

AI End Game: The Automation of All Work

Thu, 06/29/2017 - 09:02

Last week we reported from ISC on an emerging type of high performance system architecture that integrates HPC and HPA (High Performance Analytics) and incorporates, at its center, exabyte-scale memory capacity, surrounded by a variety of accelerated processors. Until the arrival of quantum computing or other new computing paradigm, this is the architecture that could enable the “scalable learning machine.” It will handle and be trained on, literally, decades of accumulated data and foster Advanced AI, the ultimate manifestation of the Big Data of today and the highest form of AI. It will co-exist with, augment, guide and, soon enough, outperform humans.

This week, we’re reporting on a startling, scholarly white paper recently issued by researchers from Yale, the Future of Humanity Institute at Oxford and the AI Impacts think tank that adumbrates the AI world to come.

The white paper – “When Will AI Exceed Human Performance?”, based on a global survey of 352 AI experts – reinforces the truism that technology is always at a primitive stage. Impressive as current Big Data and machine learning innovations are, they are embryonic compared with Advanced AI in the decades to come.

High-Level Machine Intelligence (HLMI) will transform the life we know. According to study, it’s not just conceivable but likely that all human work will be automated within 120 years, many specific jobs much sooner.

Some of the study’s findings are not surprising. We know, for example, that jobs like truck driving are limited. But the study predicts that a surprising number of occupations considered to be “value add,” creative and nonfungible, such as producing a Top 40 pop song, will be done by machines within 10 years. (Actually, truck driving also will be automated within 10 years, the survey found, which may say less about truck driving than it does pop music.)

The study asked respondents to forecast automation milestones for 32 tasks and occupations, 20 of which, they predict, will happen within 10 years. Some of the more interesting findings: language translator: seven years: retail salesperson: 12 years; writing a New York Times bestseller and performing surgery: approximately 35 years; conducting math research: 45 years.

The researchers point to two watersheds in AI revolution that will have profound impact. The first is the attainment of HLMI, “achieved when unaided machines can accomplish every task better and more cheaply than human workers.”

The researchers reported that the “aggregate forecast” gave a 50 percent chance for HLMI to occur within 45 years (and a 10 percent chance within eight years). Interestingly, respondents from Asia are more sanguine about the HLMI timeframe than those from other regions – Asian respondents expect HLMI within about 30 years, whereas North Americans expect it in 75 years.

AI research will come under the power of HLMI within 90 years, and this in turn could contribute to the second major watershed, what the AI community calls an “intelligence explosion.” This is defined as AI performing “vastly better than humans in all tasks,” a rapid acceleration in AI machine capabilities.

All of this, of course, has incredibly potent social and economic implications, which is why the researchers conducted the study.

“Advances in artificial intelligence (AI) will transform modern life by reshaping transportation, health, science, finance, and the military,” the study’s authors state. “To adapt public policy, we need to better anticipate these advances…

“Self-driving technology might replace millions of driving jobs over the coming decade,” they continue. “In addition to possible unemployment, the transition will bring new challenges, such as rebuilding infrastructure, protecting vehicle cyber-security, and adapting laws and regulations. New challenges, both for AI developers and policy-makers, will also arise from applications in law enforcement, military technology, and marketing.”

In preparing the survey, the authors targeted all researchers who published at the 2015 Neural Information Processing (NIPS) and International Conference on Machine Learning (ICML) conferences, two prominent peer-reviewed machine learning research conclaves. Twenty-one percent responded to the survey out of 1634 authors contacted.

Among the issues explored by the study:

“Will progress in AI become explosively fast once AI research and development itself can be automated? How will HLMI affect economic growth? What are the chances this will lead to extreme outcomes (either positive or negative)? What should be done to help ensure AI progress is beneficial?

While the authors state there will be “profound social consequences if all tasks were more cost effectively accomplished by machines,” the weight of survey sentiment about the ultimate impact of HLMI is optimistic. The median probability for a “good” or “extremely good” outcome was 45 percent, whereas the probability was 15 percent for a bad “extremely bad” outcome (such as human extinction).

The study did not delve into details of what those outcomes – good or bad – will look like, but the authors stated that “Society should prioritize research aimed at minimizing the potential risks of AI.”

The post AI End Game: The Automation of All Work appeared first on HPCwire.

UCAR Deploys ADVA FSP 3000 CloudConnect in Supercomputing Network

Thu, 06/29/2017 - 08:58

BOULDER, Co., June 29, 2017 — ADVA Optical Networking announced today that the University Corporation for Atmosphere Research (UCAR) has deployed its FSP 3000 CloudConnect data center interconnect (DCI) solution for ultra-high capacity connectivity to the Cheyenne supercomputer. The DCI technology is now being used to transport vital scientific data over two 200Gbit/s 16QAM connections between the NCAR-Wyoming Supercomputing Center in Cheyenne, Wyoming and the Front Range GigaPop in Denver, Colorado. With improved flexibility and increased capacity, the new network will help UCAR expand educational opportunities, enable collaboration and promote research excellence.

“This new network helps researchers to analyze large data sets and conduct complex experiments,” said Anke Kamrath, director, operations and services, NCAR’s Computational and Information Systems Laboratory, UCAR. “Cheyenne is one of the most powerful high-performance computers for the geosciences. It’s vital to advancing the understanding of weather, water and climate. This upgraded network will help us amplify the scientific work of more than 100 universities, educate the public and inspire the next generation of scientists.”

The ADVA FSP 3000 CloudConnect platform gives UCAR the ability to maximize the throughput of its optical layer and offers scalability for future growth. While incorporating advanced technology, but at the same time being simple to utilize, it reduces UCAR’s operational complexity. The ADVA FSP 3000 CloudConnect solution is a truly open DCI platform with no lock-ins or restrictions. It also meets density, security and energy demands, helping UCAR to accelerate the pace of scientific discovery. As a preeminent institution for atmosphere research, UCAR will leverage the new capabilities and efficiencies to offer the scientific community even more access to computing and data analysis platforms. Over 100 universities and research centers in the UCAR consortium will benefit from improved access to Cheyenne, helping them to better understand the atmosphere in ways that can help safeguard society.

“We’ve developed a close relationship with UCAR over many years. That’s why it’s been an honor to help them unleash the full potential of their state-of-the-art supercomputer. And it’s exciting to achieve this with our FSP 3000 CloudConnect solution – the pinnacle of all our innovation and expertise,” commented John Scherzinger, senior VP, sales, North America, ADVA Optical Networking. “Our team is passionate about efficiency, especially when it comes to energy. The DCI solution we’ve created here will deliver UCAR significant savings in terms of price, power and space. Not only is that great news for UCAR’s operating costs – it also helps protect the environment.”

Watch this video for more information on the ADVA FSP 3000 CloudConnect: https://youtu.be/nyG4S-e0qgI.

Source: ADVA

The post UCAR Deploys ADVA FSP 3000 CloudConnect in Supercomputing Network appeared first on HPCwire.

UWaterloo, Ciena Research Drives Advancements in Internet Connectivity

Thu, 06/29/2017 - 08:56

WATERLOO, Ont., June 29, 2017 — Engineering researchers at the University of Waterloo are working with Ciena to find solutions to help network operators and Internet providers respond to the insatiable demand for faster and faster data transmission over the Internet.

A key area of Waterloo’s partnership with Ciena involves squeezing as much capacity as possible out of fibre optic cables that run under the world’s oceans and handle upwards of 95 per cent of all intercontinental communications, including $10 trillion a day in financial transactions.

The reliable, high-speed transmission of vast amounts of information via undersea cables is increasingly important in fields including healthcare and academic research, as well as for consumer demand for quality high-speed Internet service on cell phones.

The research relationship received funding support from the Natural Sciences and Engineering Research Council of Canada (NSERC).

“Waterloo’s strong ties to industry help drive innovation and fuel our economy,” said Feridun Hamdullahpur, president and vice-chancellor of Waterloo. “This partnership with Ciena, possible with support from NSERC, illustrates the tangible, significant results possible when you combine the brilliant minds of Waterloo researchers with the needs and resources of industry.” 

Amir Khandani, a professor of electrical and computer engineering at Waterloo, leads a team of post-doctoral fellows and graduate students developing algorithms to efficiently and rapidly correct errors – essentially lost or dropped bits of data — that occur during extremely high-speed, long-distance transmission.

“Professor Khandani and his team, along with their partners at Ciena, have formed an impressive platform for multi-year collaboration that will help tackle the challenges of the next generation of optical telecommunication networks,” said B. Mario Pinto, President, NSERC. “This Industrial Research Chair will also provide the next generation of top research talent with rich training opportunities in highly innovative environments.”

“What matters a great deal to us is knowing that what we do is being used and is benefitting people,” said Khandani, whose team at any given time includes about eight post-doctoral fellows and graduate students. “Seeing our work have a direct and immediate impact is very rewarding.”

Included on electronic chips that are built into equipment for receiving and transmitting data, the algorithms developed by the Waterloo team free up cable capacity while also enabling the correction of errors to keep pace with other technological advances.

“The thirst for additional bandwidth and capacity is unquenchable,” said Rodney Wilson, Senior Director for External Research at Ciena. “It’s a battle of tiny, tiny increments. When you add them all up, it creates market-leading innovations. Working with the University of Waterloo gives us a competitive edge.”

Under the three-year partnership, announced at an event at Waterloo today, Khandani holds the position of Ciena/NSERC Industrial Research Chair on Network Information Theory of Optical Channels. The relationship between Waterloo Engineering and Ciena has already produced seven U.S. patents, with more pending. Many of the student researchers now occupy full-time positions with the company.

 About the University of Waterloo

University of Waterloo is Canada’s top innovation university. With more than 36,000 students we are home to the world’s largest co-operative education system of its kind. Our unmatched entrepreneurial culture, combined with an intensive focus on research, powers one of the top innovation hubs in the world. http://www.uwaterloo.ca/

Source: University of Waterloo

The post UWaterloo, Ciena Research Drives Advancements in Internet Connectivity appeared first on HPCwire.

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

Thu, 06/29/2017 - 08:32

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms based on floating point (FP) numbers. Algorithms can definitely benefit from basing their mathematics on bits and integers (bytes, words) if we could just accelerate them too. FPGAs can do this, but the hardware and software costs remain very high. GPUs aren’t designed to operate on non-FP data. Intel AVX introduced some support, and now Intel AVX-512 is bringing a great deal of flexibility to processors. I will share why I’m convinced that the “AVX512VL” capability in particular is a hidden gem that will let AVX-512 be much more useful for compilers and developers alike.

Fortunately for software developers, Intel has done a poor job keeping the “secret” that AVX-512 is coming to Intel’s recently announced Xeon Scalable processor line very soon. Amazon Web Services has publically touted AVX-512 on Skylake as coming soon!

It is timely to examine the new AVX-512 capabilities and their ability to impact beyond the more regular HPC needs for floating point only workloads. The hidden gem in all this, which enables shifting to AVX-512 more easily, is the “VL” (vector length) extensions which allow AVX-512 instructions to behave like SSE or AVX/AVX2 instructions when that suits us. This is a clever and powerful addition to enable its adoption in a wider assortment of software more quickly. The VL extensions mean that programmers (and compilers) do not need to shift immediately from 256-bits (AVX/AVX2) to 512-bits to use the new bit/byte/word manipulations. This transitional benefit is useful not only for an interim, but also for applications which find 256-bits more natural (perhaps a small, but important, subset of problems).

The “future Xeon processor” extensions for AVX-512 were first announced in mid-2014. Intel has done the right things to make software ready for it (for instance we can google for “gcc SKX” – where SKX is widely speculated to be a contraction of “Skylake” and “Xeon”).  For several years now, the gcc, Intel, Microsoft, clang and ispc compilers have supported these instructions. Also, some open source software such as embree and mkl-dnn include support. We can create programs today that use these new instructions, and run them easily using Intel’s Software Development Emulator (SDE). Finally, after years of “leaking via software patches,” in May Intel officially announced that AVX-512 will be in the highly anticipated Skylake/Purley platform as the first processor in the new Intel Xeon Processor Scalable Family (successor to the Xeon E5 and E7 product lines).

Vectorizing more than just Floating Point
The net effect: when Intel Xeon processors support AVX-512 we will have exciting new capabilities that extend the obvious use of AVX-512 for HPC and AI/ML/HPDA workloads to offer flexibility perfect for vectorization needs that include integer and bit-oriented data types as well as the strong floating-point support that first appeared with AVX-512 on Intel Xeon Phi processors. While most HPC and AI/ML/HPDA workloads lean on floating point today, there is plenty of reason to believe that algorithm innovations can benefit from integer and bit-oriented data types when hardware acceleration is available. This makes these AVX-512 developments very exciting!

AVX introduced some bit manipulation and byte/word capabilities. AVX-512 expands greatly on these capabilities. It’s the “VL” (vector length) extensions that ties SSE/AVX/AVX2/AVX_512 together in a clever way that makes these new capabilities easier to adopt, for programmers, tools and compilers.

For those exploring new algorithms for AI, including machine learning and HPDA workloads, the inclusion for bit, byte, and word operations alongside floating point opens up exciting new possibilities. The vector length extensions make them immediately useful to SSE/AVX/AVX2 programmers without forcing an immediate shift to 512-bits.

Three Highlights in the extended AVX-512
Intel documentation and the CPUID enabling bits divide up the “SKX” extensions to into AVX512DQ, AVX512BW, and AVX512VL.

  • AVX512DQ is vector support for Double-word and Quad-word integers, also commonly thought of as int32/int and int64/long. In addition to integer arithmetic and bitwise operations, there are instructions for conversions to/from floating-point vectors. Masking is supported down to the byte level which offers amazing flexibility in using these instructions.
  • AVX512BW is vector support for Byte (half-words) and Words, also commonly thought of as char/int8 and short/int16. A rich set of instructions for integer arithmetic and bitwise operations are offered. Masking is supported down to the byte level with AVX512BW as well.
  • AVX512VL ups the ante enormously, by making almost all of the AVX512 instructions available as SSE and AVX instructions, but with a full 32 register capability (at least double the registers that SSE or AVX instructions have to offer). AVX512VL is not actually a set of new instructions. It is an orthogonal feature that applies to nearly all AVX-512 instructions (the exceptions make sense – a few AVX512F and AVX512DQ instructions with implicit 256 or 128 bit widths such as those explicitly working on 32×4 and 64×2 data).

The trend with vector instructions to grow to longer and longer lengths has not been without its disadvantages and difficulties for programmers. While longer vectors are often a great thing for many supercomputer applications, longer vector lengths are often more difficult to use all the time when handling compute problems that do not always have long vectors to process. The VL extensions to AVX-512 bring flexibility to Intel’s AVX-512 that broaden its applicability.

AVX-512 instructions offer a rich collection of operations but they have always operated on the 512-bit registers (ZMM). The downside of a 512-bit register is that it wastes bandwidth and power to use a 512-bit instructions and registers for 256 or 128-bit operations.  Well optimized code (such as that emitted by compilers, or experienced intrinsic or assembly programmers) would seemingly have a careful mix of SSE (128-bit SIMD) and AVX (256-bit SIMD). Doing this is clumsy at best, and complicated by a performance penalty when mixing SSE and AVX code (fortunately there is no performance penalty when mixing AVX and AVX-512 instructions).

The VL extension enables AVX-512 instructions to operate on XMM (128-bit) and YMM (256-bit) registers, and are not limited to just the full ZMM registers. This symmetry definitely is good news. AVX-512, with the VL extension, seems well set to be the programming option of choice for compilers and hand coders because it unifies so many capabilities together along with access to 32 vector registers regardless of their size (XMM, YMM or ZMM).

More AVX-512 features after Skylake?
There is further evidence in the open source enabling work of additional instructions coming after Skylake. Intel has placed documentation of these into its Software Developer Guides, and helped add support to open source projects. Again, this is all good news for software developer because it means software tools do not need to lag the hardware. The four categories of instructions that are documented by Intel thus far, and are not enabled by compiler options for “KNL” (Intel Xeon Phi processors) or “SKX” (Skylake Xeon processors) are:

  • AVX512IFMA: 2 instructions for high/low result of Fused Multiply-Add for 2/4/8-element vectors of 52-bit integers stored in 64-bit fields of 128/256/512-bit vectors.
  • AVX512VBMI: Byte-level vector permute (on 128/256/512/1024-bit vectors) and a select+pack instruction — 4 instructions (with multiple argument types)
  • AVX512_4VNNIW: Vector instructions for deep learning enhanced word variable precision.
  • AVX512_4FMAPS: Vector instructions for deep learning floating-point single precision.

Speculation, supported by open source documentation, says that the first two will appear in a Xeon processor after Skylake, and the latter two will appear in a future Intel Xeon Phi processor (Knights Mill is commonly suggested).

Regardless of how perfect my speculations are on timing, it is clear that Intel is investing heavily in their expansion of vector capabilities to much more than floating-point operations.  This gives a much-expanded capability for algorithms to be developed with a much wider variety of arithmetic and bitwise operations than floating-point alone can offer.  Who will take advantage of these remains to be seen – but ML/AI and cryptographic programmers seem to be obvious candidates.

With Skylake-architecture based Intel Xeon processors coming soon, it is a great time for programmers to take a closer look.  The instructions are already well supported in tools, and the Intel Software Development Emulator (SDE) makes it easy to run the instructions today.

For More Information
I recommend the following sites for more detailed information:

  • Intel’s online guide to AVX-512 instructions as they are best accessed in C/C++ (intrinsics) has a detailed guide (click on instructions to expand) for AVX-512 instructions.
  • The Intel Software Development Emulator (SDE) allows us to run programs using these Intel AVX-512 instructions on our current x86 systems. The SDE runs code via emulation, which is accurate but slower than on hardware with support for the instructions built-in.
  • Intel documentation is massive, a complete list “Intel® 64 and IA-32 Architectures Software Developer Manuals” covers everything, the specific documents to learn about AVX-512VL are the “Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4” and “Intel® architecture instruction set extensions programming reference.” Normally, information in the latter document migrates to the former when support appears in a shipping product – so eventually everything about VL will be in the first document. I did not link the specific documents because their links change. We can find them on the main documentation page and click there to get the latest documents. I scroll through them by searching for “AVX512VL” in my PDF viewer and moving from match to match.

About the Author
James Reinders likes fast computers and the software tools to make them speedy. Last year, James concluded a 10,001 day career at Intel where he contributed to projects including the world’s first TeraFLOPS supercomputer (ASCI Red), compilers and architecture work for a number of Intel processors and parallel systems. James is the founding editor of The Parallel Universe magazine and has been the driving force behind books on VTune (2005), TBB (2007), Structured Parallel Programming (2012), Intel Xeon Phi coprocessor programming (2013), Multithreading for Visual Effects (2014), High Performance Parallelism Pearls Volume One (2014) and Volume Two (2015), and Intel Xeon Phi processor (2016). James resides in Oregon, where he enjoys both gardening and HPC and HPDA consulting.

The post Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors appeared first on HPCwire.

MareNostrum 4 Begins Operation

Thu, 06/29/2017 - 05:16

BARCELONA, June 29 — The MareNostrum supercomputer is beginning operation and will start executing applications for scientific research. MareNostrum 4, hosted by Barcelona Supercomputing Center, is entirely aimed at generating scientific knowledge and its computer architecture has been called ‘the most diverse and interesting in the world’ by international experts. The Spanish Ministry of Economy, Industry and Competitiveness has funded the purchase of the supercomputer, whose installation cost €34 million in total.

11.1 Petaflops of processing power

MareNostrum will provide 11.1 Petaflops of processing power – that is, the capacity to perform 11.1 x (1015) operations per second– to scientific production. This is the capacity of the general-purpose cluster, the largest and most powerful part of the supercomputer, which will be increased thanks to the installation of three new, smaller-scale clusters, featuring emerging technologies, over the next few months. The capacity of 11.1 Petaflops is 10 times greater than that of MareNostrum 3, which was installed between 2012 and 2013.

According to the Top500 ranking published on 19 June, the MareNostrum 4 supercomputer’s general-purpose cluster is the third most powerful one in Europe and the thirteenth in the world. The Top500 list is based on how quickly supercomputers execute the high-performance linpack benchmark. 

A tool of great value for science

Supercomputers are used for basic and applied research thanks to their ability to perform large calculations, execute large simulations and analyse large amounts of data. Today, they are used in almost all scientific disciplines, from astrophysics and materials physics to biomedicine, and are used in engineering and industry.

During its first four months of operation, MareNostrum 4 will be used for research projects on climate change, gravitational waves, a vaccination against AIDS, new radiation treatments to fight cancer and simulations relating to the production of fusion energy, among other areas.

Access via scientific committees

MareNostrum 4 is available to all scientists in Europe via a selection process managed by scientific committees. For the chance to use the supercomputer, researchers must submit a request to the Spanish Supercomputing Network (RES, according to its initials in Spanish) – which provides access to 16% of the computing hours available on the machine – or to PRACE (the Partnership for Advanced Computing in Europe) – which manages access to 80% of its computing hours. The remaining 4% is reserved for use by BSC researchers. The MareNostrum 4 supercomputer is designated as a Special Scientific/Technical Infrastructure Facility by the Spanish Ministry of Economy, Industry and Competitiveness.

Barcelona Supercomputing Center

Barcelona Supercomputing Center is the leading supercomputing centre in Spain. It specialises in High Performance Computing and its mission is twofold: to offer supercomputing facilities and services to Spanish and European scientists, and to create knowledge and technology to be transferred to society.

Barcelona Supercomputing Center employs 500 staff, of whom 27 form part of the Operations Department, which manages the supercomputer, and 400 who work in research across a wide range of areas. The Computer Sciences Department, which works to influence how future supercomputers will be built, programmed and used, is the centre’s largest department. Research is also carried out in the fields of personalised medicine and drug discovery, as well as climate change, air quality and engineering.

BSC is a Severo Ochoa Centre of Excellence and a leadership-level (Tier-0) member of the PRACE infrastructure, as well as managing the Spanish Supercomputing Network. It was created in 2005 and is a consortium formed by the Spanish Government Ministry of Economy, Industry and Competitiveness (60%), the Catalan Government Department of Enterprise and Knowledge (30%) and the Univeristat Politècnica de Catalunya (UPC) (10%).

MareNostrum 4: technical summary

MareNostrum 4 has been dubbed the most interesting supercomputer in the world thanks to the heterogeneity of the architecture it will include once installation of the supercomputer is complete. Its total speed will be 13.7 Petaflops. The supercomputer includes two separate parts: a general-purpose block and a block featuring emerging technologies. It has 5 storage racks with the capacity to store 14 Petabytes (14 million Gigabytes) of data. A high-speed Omnipath network connects all the components in the supercomputer to one another.

The general-purpose block has 48 racks with 3,456 nodes. Each node has two Intel Xeon Platinum chips, each with 24 processors, amounting to a total of 165,888 processors and a main memory of 390 Terabytes. Its peak performance is 11.15 Petaflops. While its performance is 10 times greater than its predecessor, MareNostrum 3, its energy consumption will only increase by 30% to 1.3 MW per year.

The block of emerging technologies is formed of clusters of three different technologies, which will be incorporated and updated as they become available on the market. These technologies are currently being developed in the United States and Japan to speed up the arrival of the new generation of pre-exascale supercomputers. They are as follows:

·     Cluster comprising IBM POWER9 and NVIDIA Volta GPUs, with a computational capacity of over 1.5 Petaflops. IBM and NVIDIA will use these processors for the Summit and Sierra supercomputers that the US Department of Energy has ordered for its Oak Ridge and Lawrence Livermore National Laboratories.

·     Cluster formed of Intel Knights Hill (KNH) processors, with a computational capacity of over 0.5 Petaflops. These are the same processors as those to be used in the Theta and Aurora supercomputers that the US Department of Energy has ordered for the Argonne National Laboratory.

·     Cluster composed of 64-bit ARMv8 processors in a prototype machine with a computational capacity of over 0.5 Petaflops. This cluster will use the cutting-edge technology of the Japanese supercomputer Post-K.

The aim of gradually incorporating these emerging technologies into MareNostrum 4 is to allow BSC to experiment with what are expected to be the most advanced technological developments over the next few years and evaluate their suitability for future iterations of MareNostrum.

MareNostrum 4 has a disk storage capacity of 14 Petabytes and is connected to BSC’s big data facilities, which have a total capacity of 24.6 Petabytes. Like its predecessors, MareNostrum 4 will also be connected to European research centres and European universities via the RedIris and Géant networks.

Links to videos and photos (available until 10 July)

Link to MareNostrum 4: Replacing MareNostrum 3 with MareNostrum 4 timelapse video: bsc.es/MN4-timelapse

Link to MareNostrum 4, technical specification video: bsc.es/MN4-sketch

Photos of MareNostrum 4 in high and low resolution: bsc.es/MN4-fotos

Source: Barcelona Supercomputing Center

The post MareNostrum 4 Begins Operation appeared first on HPCwire.

SPEC/HPG Hardware Acceleration Benchmark Adds OpenMP Suite

Wed, 06/28/2017 - 14:24

GAINESVILLE, Va., June 28, 2017 – SPEC’s High-Performance Group (SPEC/HPG) has released a new version of its SPEC ACCEL software that adds a suite of OpenMP applications for measuring the performance of systems using hardware accelerator devices and supporting software. SPEC ACCEL also measures performance for computationally intensive parallel applications running under the OpenCL and OpenACC programming models.

Another major update in SPEC ACCEL 1.2 allows users to add and change directives within the OpenMP and OpenACC suites to expose more parallelism for peak performance testing. SPEC/HPG also revised applications in the OpenACC suite to allow for successful GNU compiling, making results no longer comparable with those from version 1.0 or 1.1.

Broader benchmarking scope

SPEC ACCEL 1.2 exercises the performance of the accelerator, host CPU, memory transfer between host and accelerator, support libraries and drivers, and compilers. The new OpenMP suite contains the same applications and datasets as the OpenACC suite, but results are not directly comparable, since the benchmarks use different reference systems and in some cases different parallelization constructs.

Vendors can use SPEC ACCEL to improve performance of systems that include accelerator devices. Users can employ the software to make buying and configuration decisions. Researchers can use it to assess the ramifications of new technologies on performance.

“The OpenMP application benchmarks are the first of their kind and now give our customers the opportunity to compare hardware configurations based on the most popular open-programming models,” says Guido Juckeland, SPEC/HPG vice chair. “We look forward to a wide variety of SPEC ACCEL result submissions on the SPEC website and a number of research papers comparing various optimization settings on multiple platforms.”

SPEC ACCEL 1.2 comprises 19 application benchmarks running under OpenCL and 15 each under OpenACC and OpenMP.  The OpenCL suite is derived from the Parboil benchmark developed by the IMPACT Research Group of the University of Illinois at Urbana-Champaign and the Rodinia benchmark from the University of Virginia. The OpenACC and OpenMP suites include tests from NAS Parallel Benchmarks (NPB), SPEC OMP2012, and others derived from high-performance computing (HPC) applications.

SPEC ACCEL 1.2 also contains the latest version of SPEC PTDaemon, which enables power measurements while the benchmark is running, providing a separate metric for energy efficiency.

SPEC/HPG members involved in SPEC ACCEL 1.2 development include AMD, HPE, IBM, Intel, Nvidia and Oracle. SPEC/HPG Associates include Argonne National Laboratory; Helmholtz-Zentrum Dresden-Rossendorf; Indiana University; Oak Ridge National Laboratory; RWTH Aachen University; Technische Universitat Dresden, ZIH; and University of Delaware.

Available immediately

SPEC ACCEL 1.2 is available for immediate download on the SPEC website.  The benchmark suite is $2,000 for non-members and $800 for qualified non-profit and not-for-profit organizations. Existing license holders receive a free upgrade. For more information, visit http://www.spec.org/accel/.

About SPEC

SPEC is a non-profit organization that establishes, maintains and endorses standardized benchmarks and tools to evaluate performance and energy consumption for the newest generation of computing systems. Its membership comprises more than 120 leading computer hardware and software vendors, educational institutions, research organizations, and government agencies worldwide.

Source: SPEC

The post SPEC/HPG Hardware Acceleration Benchmark Adds OpenMP Suite appeared first on HPCwire.

InfiniBand Continues to Lead TOP500 as Interconnect of Choice for HPC

Wed, 06/28/2017 - 14:21

BEAVERTON, Ore., June 28, 2017 – The InfiniBand Trade Association (IBTA), a global organization dedicated to maintaining and furthering the InfiniBand specification, today shared the latest TOP500 List results, which reveal that InfiniBand remains the most used High Performance Computing (HPC) interconnect. Additionally, the majority of newly listed TOP500 supercomputers are accelerated by InfiniBand technology. These results reflect continued industry demand for InfiniBand’s unparalleled combination of network bandwidth, low latency, scalability and efficiency.

InfiniBand connects 60 percent of the HPC systems on the list as well as 48 percent of the Petascale systems. Furthermore, over half of the new HPC systems that were added to the TOP500 List leverage InfiniBand, 2-3 times more compared to proprietary options. Systems featured in the list include the new Artificial Intelligence (AI) supercomputer from Facebook, which underscores the impact that InfiniBand is having on Deep Learning and AI applications that require high speed performance and compute efficiency.

InfiniBand EDR 100 Gb/s, the fastest interconnect technology currently available, grew 2.5 times compared to the November list published six months ago. One of these new systems delivered the highest efficiency for newly added systems at 94 percent.

“As demonstrated on the June 2017 TOP500 supercomputer list, InfiniBand is the high-performance interconnect of choice for HPC and Deep Learning platforms,” said Bill Lee, IBTA Marketing Working Group Co-Chair. “The key capabilities of RDMA, software-defined architecture, and the smart accelerations that the InfiniBand providers have brought with their offering resulted in enabling world-leading performance and scalability for InfiniBand-connected supercomputers.”

As HPC applications continue to evolve, especially in the case of AI and deep learning, system architects can rely on InfiniBand for unmatched network capabilities – both today and well into the future as evident in the IBTA Roadmap.

The TOP500 List (www.top500.org) is published twice per year and ranks the top supercomputers worldwide based on the LINPAC benchmark rating system, providing valuable statistics for tracking trends in system performance and architecture.

About the InfiniBand Trade Association

The InfiniBand Trade Association was founded in 1999 and is chartered with maintaining and furthering the InfiniBand and the RoCE specifications. The IBTA is led by a distinguished steering committee that includes Broadcom, Cray, HPE, IBM, Intel, Mellanox Technologies, Microsoft, Oracle and QLogic. Other members of the IBTA represent leading enterprise IT vendors who are actively contributing to the advancement of the InfiniBand and RoCE specifications. The IBTA markets and promotes InfiniBand and RoCE from an industry perspective through online, marketing and public relations engagements, and unites the industry through IBTA-sponsored technical events and resources. For more information on the IBTA, visit www.infinibandta.org.

Source: InfiniBand Trade Association

The post InfiniBand Continues to Lead TOP500 as Interconnect of Choice for HPC appeared first on HPCwire.

TACC Supercomputers Design, Test New Tools for Cancer Detection

Wed, 06/28/2017 - 14:06

AUSTIN, Texas, June 28, 2017 — An important factor in fighting cancer is the speed at which the disease can be identified, diagnosed and treated.

The current standard involves a patient feeling ill or a physician seeing signs of a tumor. These indicators lead to more precise diagnoses via blood tests, x-rays or MRI imaging. But once the disease is far enough along to be noticeable, the cancer has often spread.

In the future, though, it may be possible to diagnose cancer much earlier using more sensitive body scans, new types of biomarker tests, and even nano-sensors working in the bloodstream.

Experimenting with these techniques in cancer patients or healthy individuals is difficult and potentially unethical. But scientists can test these technologies virtually using supercomputers to simulate the dynamics of cells and tissues.

Building a Better Breast Cancer Early Detection System 

Manual breast exams and mammograms are currently the most effective and widely used techniques for early detection of breast cancer. Unfortunately, manual breast exams are limited in their ability to detect tumors since they only produce local information about the site where the force is applied.

Mammograms (breast x-rays), on the other hand, are more accurate, but expose patients to radiation. Importantly, they do not quantify tissue stiffness, an identifying characteristic of breast tumors. They also produce many false positives, resulting in painful biopsies.

System used to collect data from tissue phantoms for a new breast cancer diagnostic system. Data collected by the device is computationally modeled to identify possible tumors. [Courtesy: Lorraine Olson, Robert Throne, Adam Nolte, Rose-Hulman Institute of Technology]

Lorraine Olson, a professor of mechanical engineering at Rose-Hulman Institute of Technology, is collaborating with colleagues Robert Throne of Electrical and Computer Engineering and Adam Nolte of Chemical Engineering to develop an electro-mechanical device that gently indents breast tissue in various locations and records the tissue surface deflections. This data is then converted into detailed 3-D maps of breast tissue stiffness, which can then be used to identify suspicious (stiffer) sites for further testing.

“The research takes an approach to early detection of breast cancer that utilizes a fundamental mechanical difference between cancerous and noncancerous tissue,” Olson said. “Although this stiffness difference is the basis of manual breast exams, it has not been systematically investigated from an engineering point of view.”

Olson and her team’s approach to determining the relationship between stiffness and interior mapping involves a combination of finite element methods — a numerical method for solving problems in engineering and mathematical physics — and genetic algorithms — a method for solving optimization problems based on natural selection.

Paired together, they can map the distribution of stiffness in a given tissue and systematically use “guesses and checks” to find which tissue stiffness map best models the response they actually see in testing.

The process involves thousands of these “guesses” and therefore requires powerful supercomputers like Stampede, at the Texas Advanced Computing Center (TACC), one of the most powerful in the world.

After numerous computer studies, the team has begun to experimentally validate this model using gelatin tissue phantoms (similar to Jell-O) with and without stiffer “tumors.” They have been running indentation experiments to measure surface displacements on the tissue and identify tumor locations. They presented their work, which is supported by the National Science Foundation, at the 2016 Inverse Problems Symposium.

“This system has the potential to significantly increase the early detection of breast cancer with no unnecessary radiation, essentially no risk, and with little additional cost,” Olson said.

Designing Nanoscale DNA-Readers

Olson, Throne and Nolte’s electromechanical technique works on the surface of the body, but an emerging class of nano-scale sensors aims to diagnose cancer from within the body.

Nanosensors must be small and sensitive, targeting specific biomarkers that may indicate the presence of cancer. They must also be able to communicate that information to an outside observer. Scientists and sci-fi authors have long predicted the rise of nanosensors, but only recently has it become feasible to engineer such technologies.

Molecular dynamics simulations on Stampede reproduced the capture of nanocarrier-DNA complex by a mutant alpha-hemolysin pore embedded in lipid bilayer. This video depicts ~35 nanoseconds of simulation time. The protein is shown in orange cutaway, the membrane in ochre lines and spheres, and the nanocarrier-DNA complex in licorice: DNA in red, proteo-nucleic acids (PNA) in teal, and polycationic peptide tag in blue and green. [Courtesy: Kai Tian, Karl Decker, Aleksei Aksimentiev, and Li-Qun Gu, University of Missouri, University of Illinois at Urbana-Champaign]

A number of scientists have been using TACC’s supercomputers to investigate aspects of this problem. One such researcher is Aleksei Aksimentiev, a professor of biological physics at the University of Illinois, Urbana-Champaign. Aksimentiev focuses on creating silicon nanopore devices that can sequence DNA inside the body to detect the telltale signs of cancer or other diseases.

A nanopore is essentially a tiny hole in a very thin membrane, through which an even smaller particle, like DNA, can pass. In addition to being precisely shaped, it must be able to attract the right molecules and induce them to pass through the pore so they can be genetically sequenced and identified.

Writing in ACS Nano in December 2016, Aksimentiev and bioengineering professor Li-Qun (Andrew) Gu from the University of Missouri’s Dalton Cardiovascular Research Center described efforts to detect genetic biomarkers using nanopores and synthetic nanocarriers. The nanocarriers selectively bind to target biomolecules, and increase their response to the electric field gradient generated by the nanopore, essentially forcing them through the hole.

The researchers showed that modestly charged nanocarriers can be used to detect and capture DNA or RNA molecules of any length or secondary structure. Such selective, molecular detection technologies would greatly improve the real-time analysis of complex clinical samples for cancer detection and other diseases.

Aksimentiev used TACC’s Stampede supercomputer, as well as Blue Waters at the National Center for Supercomputing Applications, to design and virtually test the behavior of these nanopores systems.

“In the development of nanosensors, such as the nanopore single-molecule sensor for genetic diagnosis of cancer, we can experimentally discover various clinically useful phenomena at the nanometer scale. But our collaborator, Dr. Aksimentiev can utilize their superior computational power to accurately dig out the molecular mechanisms behind these experimental observations,” said Gu. “These new nano-mechanisms can guide the design of a new generation of nanopore sensors for genetic marker-based cancer diagnostics, which we believe will play an important role in precision oncology.”

This work was supported by grants from the National Institutes of Health (R01-GM079613, R01-GM114204).

Read the full article at: https://www.tacc.utexas.edu/-/more-precise-diagnostics-for-better-cancer-outcomes 

Source: Aaron Dubrow, TACC

The post TACC Supercomputers Design, Test New Tools for Cancer Detection appeared first on HPCwire.

IBM Power Systems Streamlines CipherHealth Platform for Patient Care

Wed, 06/28/2017 - 13:39

ARMONK, N.Y., June 28, 2017 — IBM (NYSE: IBM) today announced that CipherHealth, a SaaS healthcare provider, has deployed IBM Power Systems infrastructure to run its technology platform that helps healthcare providers reduce re-admissions and improve the patient experience by providing effective patient engagement from pre-hospitalization through to post-discharge. The move to the new infrastructure has halved CipherHealth’s monthly infrastructure costs, and improved its data processing times by nearly 90 percent.

With the strong core performance of IBM’s Power 8 Processor and the IBM Power Systems’ hypervisor, PowerVM, CipherHealth reports experiencing between 2-3x the performance compared to previous x86 solutions and at a fraction of the cost. Offering 2x performance per core, 4x memory bandwidth, and 4x the number of hardware threads versus the compared  x86 processors, IBM Power Systems are designed for Cognitive and Big Data workloads.

“Our platform requires high reliability and accessibility at all times, which were hard to achieve through our existing x86-based hosted services,” said Zach Silverzweig, co-founder of CipherHealth. “Since deploying the IBM solution, we’ve experienced 2-3 times the performance at half the price. More importantly, our Power Systems infrastructure is enabling us to focus on delivering innovative products to millions of patients across the U.S.”

IBM Lab Services, which comprises professionals with extensive expertise to design, build, and deliver IT infrastructure for the cognitive era, worked with CipherHealth to develop a custom open source database (OSDB) solution based on MongoDB, PostgreSQL, Redis and NGINX built for Linux on Power Systems. The result enabled CipherHealth to implement its new private cloud on IBM Power System S824L servers, and migrate the vast majority of its client services to the new platform. Since CipherHealth was new to Power Systems, the IBM team also provided quick-start services with skills transfer to help CipherHealth get the most out of its new Power Systems environment.

“Companies can help minimize infrastructure cost by leveraging the performance and reliability of IBM Power Systems for data-rich workloads,” said Chuck Bryan, Linux and Open Source Offering Manager, IBM Systems.  “By moving to Power Systems, CipherHealth has reported that they gained lower total cost of ownership, better reliability for continuity of service to their end-clients, better performance, and the ability to shift developers’ time to innovation-focused projects.”

CipherHealth is also currently deploying an IBM Power System S822LC for Commercial Computing server to support its continuous integration testing environment. Rather than paying for the ability to scale up to 50 containers across multiple servers in the cloud, the company plans to run 150 containers on one system which are designed to, in order to reduce testing time and, ultimately, facilitate faster delivery of new functionality to its end-users.

IBM Power Systems at Postgres Vision
On Tuesday, June 26th at Postgres Vision 2017,in Boston, Paul Zikopoulos, Vice President, Cognitive Big Data Systems at IBM will present a keynote on ‘Exploring the Advantages of Open Technology Ecosystems in the Era of Artificial Intelligence’.

For more on IBM’s broad Linux portfolio based on open technology and an open ecosystem, visit:ibm.com/systems/power/software/linux/.

About CipherHealth
Since 2009, CipherHealth has been innovating and delivering patient engagement and care coordination solutions to help providers effectively and efficiently provide high quality care for their patients. By harnessing technology to improve patient outcomes and experiences, CipherHealth and its suite of products focus on the evolution of patient care.

Source: IBM

The post IBM Power Systems Streamlines CipherHealth Platform for Patient Care appeared first on HPCwire.

DoE Awards 24 ASCR Leadership Computing Challenge (ALCC) Projects

Wed, 06/28/2017 - 09:57

On Monday, the U.S. Department of Energy’s (DOE’s) ASCR Leadership Computing Challenge (ALCC) program awarded 24 projects a total of 2.1 billion core-hours at the Argonne Leadership Computing Facility (ALCF). The one-year awards are set to begin July 1. Several of the 2017-2018 ALCC projects will be the first to run on the ALCF’s new 9.65 petaflops Intel-Cray supercomputer, Theta, when it opens to the full user community July 1.

Theta, of course, is based on the second-generation of the Intel Xeon Phi processor (more detailed system description at the end of article). Projects in the Theta Early Science Program performed science simulations on the system, but those runs served a dual purpose of helping to stress-test and evaluate Theta’s capabilities. The new projects are focused on the science research.

Each year, the ALCC program selects projects with an emphasis on high-risk, high-payoff simulations in areas directly related to the DOE mission and for broadening the community of researchers capable of using leadership computing resources. In 2017, the ALCC program awarded 40 projects totaling 4.1 billion core-hours across the three ASCR facilities. More 2017/2018 projects may be announced at a later date as ALCC proposals can be submitted throughout the year.

For one of the 2017-2018 ALCC projects, Argonne physicist Katrin Heitmann will use ALCF computing resources to continue work to build a suite of multi-wavelength, multi-cosmology synthetic sky maps. The left image (red) shows the baryonic density in a large cluster of galaxies, while the right image (blue) shows the dark matter content in the same cluster.

The 24 projects awarded time at the ALCF are noted below. Some projects received additional computing time at OLCF and/or NERSC.

  • Thomas Blum from University of Connecticut received 220 million core-hours for “Hadronic Light-by-Light Scattering and Vacuum Polarization Contributions to the Muon Anomalous Magnetic Moment from Lattice QCD with Chiral Fermions.”
  • Choong-Seock Chang from Princeton Plasma Physics Laboratory received 80 million core-hours for “High-Fidelity Gyrokinetic Study of Divertor Heat-Flux Width and Pedestal Structure.”
  • John T. Childers from Argonne National Laboratory received 58 million core-hours for “Simulating Particle Interactions and the Resulting Detector Response at the LHC and Fermilab.”
  • Frederico Fiuza from SLAC National Accelerator Laboratory received 50 million core-hours for “Studying Astrophysical Particle Acceleration in HED Plasmas.”
  • Marco Govoni from Argonne National Laboratory received 60 million core- hours for “Computational Engineering of Electron-Vibration Coupling Mechanisms.”
  • William Gustafson from Pacific Northwest National Laboratory received 74 million core-hours for “Large-Eddy Simulation Component of the Mesoscale Convective System Climate Model Development and Validation (CMDV-MCS) Project.”
  • Olle Heinonen from Argonne National Laboratory received 5 million core-hours for “Quantum Monte Carlo Computations of Chemical Systems.”
  • Katrin Heitmann from Argonne National Laboratory received 40 million core-hours for “Extreme-Scale Simulations for Multi-Wavelength Cosmology Investigations.”
  • Phay Hofrom Argonne National Laboratory received 68 million core-hours for “Imaging Transient Structures in Heterogeneous Nanoclusters in Intense X-ray Pulses.”
  • George Karniadakis from Brown University received 20 million core-hours for “Multiscale Simulations of Hematological Disorders.”
  • Daniel Livescu from Los Alamos National Laboratory received 60 million core-hours for “Non-Boussinesq Effects on Buoyancy-Driven Variable Density Turbulence.”
  • Alessandro Lovato from Argonne National Laboratory received 35 million core-hours for “Nuclear Spectra with Chiral Forces.”
  • Elia Merzari from Argonne National Laboratory received 85 million core-hours for “High-Fidelity Numerical Simulation of Wire-Wrapped Fuel Assemblies.”
  • Paul Messina from Argonne National Laboratory received 530 million core-hours for “ECP Consortium for Exascale Computing.”
  • Aleksandr Obabko from Argonne National Laboratory received 50 million core-hours for “Numerical Simulation of Turbulent Flows in Advanced Steam Generators – Year 3.”
  • Mark Petersen from Los Alamos National Laboratory received 25 million core-hours for “Understanding the Role of Ice Shelf-Ocean Interactions in a Changing Global Climate.”
  • Benoit Roux from the University of Chicago received 80 million core-hours for “Protein-Protein Recognition and HPC Infrastructure.”
  • Emily Shemon from Argonne National Laboratory received 44 million core-hours for “Elimination of Modeling Uncertainties through High-Fidelity Multiphysics Simulation to Improve Nuclear Reactor Safety and Economics.”
  • Ilja Siepmann from University of Minnesota received 130 million core-hours for “Predictive Modeling of Functional Nanoporous Materials, Nanoparticle Assembly, and Reactive Systems.”
  • Tjerk Straatsma from Oak Ridge National Laboratory received 20 million core-hours for “Portable Application Development for Next-Generation Supercomputer Architectures.”
  • Sergey Syritsyn from RIKEN BNL Research Center received 135 million core-hours for “Nucleon Structure and Electric Dipole Moments with Physical Chirally-Symmetric Quarks.”
  • Sergey Varganov from University of Nevada, Reno received 42 million core-hours for “Spin-Forbidden Catalysis on Metal-Sulfur Proteins.”
  • Robert Voigt from Leidos received 110 million core-hours for “Demonstration of the Scalability of Programming Environments By Simulating Multi-Scale Applications.”
  • Brian Wirth from Oak Ridge National Laboratory received 98 million core-hours for “Modeling Helium-Hydrogen Plasma Mediated Tungsten Surface Response to Predict Fusion Plasma Facing Component Performance.”

Managed by the Advanced Scientific Computing Research (ASCR) program within DOE’s Office of Science, the ALCC program provides awards of computing time that range from a few million to several-hundred-million core-hours to researchers from industry, academia, and government agencies. These allocations support work at the ALCF, the Oak Ridge Leadership Computing Facility (OLCF), and the National Energy Research Scientific Computing Center (NERSC), all DOE Office of Science User Facilities.

Theta Description (from ALCF web site):
Designed in collaboration with Intel and Cray, Theta will serve as a stepping stone to the ALCF’s next leadership-class supercomputer, Aurora. Both Theta and Aurora will be massively parallel, many-core systems based on Intel processors and interconnect technology, a new memory architecture, and a Lustre-based parallel file system, all integrated by Cray’s HPC software stack.

Theta is equipped with 3,624 nodes, each containing a 64 core processor with 16 gigabytes (GB) of high-bandwidth in-package memory (MCDRAM), 192 GB of DDR4 RAM, and a 128 GB SSD. Theta’s initial parallel file system is 10 petabytes.

Theta has several features that will allow scientific codes to achieve higher performance, including:

  • High-bandwidth MCDRAM (300 – 450 GB/s depending on memory and cluster mode), with many applications running entirely in MCDRAM or using it effectively with DDR4 RAM
  • Improved single thread performance
  • Potentially much better vectorization with AVX-512
  • Large total memory per node (208 GB on Theta vs. 16 GB on Mira)

Theta System Configuration

  • 20 racks
  • 3,624 nodes
  • 231,935 cores
  • 56 TB MCDRAM
  • 679 TB DDR4
  • 453 TB SSD
  • Aries interconnect with Dragonfly configuration
  • 10 PB Lustre file system
  • Peak performance of 9.65 petaflops

Link to ALCF Article: http://www.alcf.anl.gov/articles/alcc-program-awards-alcf-computing-time-24-projects

Link to ASCR Leadership Computing Challenge: https://science.energy.gov/ascr/facilities/accessing-ascr-facilities/alcc/alcc-current-awards/

Source: Argonne Leadership Computing Facility

The post DoE Awards 24 ASCR Leadership Computing Challenge (ALCC) Projects appeared first on HPCwire.

Equus Compute Solutions Qualifies as 2017 Intel Platinum Technology Provider

Wed, 06/28/2017 - 09:05

June 28, 2017 — Equus Compute Solutions announced it has qualified as a 2017 Intel Platinum Technology Provider in both the HPC Data Center Specialist and Cloud Data Center Specialist categories. To receive these designations, Equus demonstrated commitment and excellence in deploying Intel-based data center solutions. Equus technical staff successfully completed a set of rigorous, HPC and Cloud data center-focused training courses designed to build enhanced proficiency in delivering leading these technologies.

As an Intel Platinum Technology Provider, Equus has access to a number of value-added benefits. Access to Intel trainings and resources ensures Equus customers can gain market leading insights into the latest technologies and solutions. Collaboration with Intel cloud experts helps Equus deliver the right configuration, tailored specifically to customer requirements. The ability to leverage Intel test tools means Equus can accelerate solution schedules, ensure high quality, and offer customers the lowest total cost of ownership.

“Working closely with Intel at this Platinum level means Equus can help our customers deploy the most advanced software defined infrastructure solutions,” said Steve Grady, VP Customer Solutions. “We look forward to combining our Technology Provider program expertise with the Intel Builders Programs: Cloud, Storage and Network to create custom cost-effective solutions.”

More information on the Intel-powered Equus software defined infrastructure solutions is available at http://www.equuscs.com/servers .

About Equus

Equus Compute Solutions customizes white box servers and storage solutions to enable flexible software-defined infrastructures. Delivering low-cost solutions for the enterprise, software appliance vendors, and cloud providers, Equus is one of the leading white-box systems and solutions integrators. Over the last 28 years, we have delivered more than 3.5 million custom-configured servers, software appliances, desktops, and notebooks throughout the world. Our advanced systems support software-defined storage, networking, and virtualization that enable a generation of hyper-converged scale-out applications and solutions. From components to complete servers purchased online through ServersDirect.com, to fully customized fixed-configurations, white box is our DNA. Custom cost-optimized compute solutions is what we do, and driving successful customer business outcomes is what we deliver. Find out how to enable your software-defined world with us at www.equuscs.com.

Source: Equus

The post Equus Compute Solutions Qualifies as 2017 Intel Platinum Technology Provider appeared first on HPCwire.