Sequential model-based optimization (also known as Bayesian optimization) is one of the most efficient methods (per function evaluation) of function minimization. This efficiency makes it appropriate for optimizing the hyperparameters of machine learning algorithms that are slow to train. The Hyperopt library provides algorithms and parallelization infrastructure for performing hyperparameter optimization (model selection) in Python. This paper presents an introductory tutorial on the usage of the Hyperopt library, including the description of search spaces, minimization (in serial and parallel), and the analysis of the results collected in the course of minimization. This paper also gives an overview of Hyperopt-Sklearn, a software project that provides automatic algorithm configuration of the Scikit-learn machine learning library. Following Auto-Weka, we take the view that the choice of classifier and even the choice of preprocessing module can be taken together to represent a single large hyperparameter optimization problem. We use Hyperopt to define a search space that encompasses many standard components (e.g. SVM, RF, KNN, PCA, TFIDF) and common patterns of composing them together. We demonstrate, using search algorithms in Hyperopt and standard benchmarking data sets (MNIST, 20-newsgroups, convex shapes), that searching this space is practical and effective. In particular, we improve on best-known scores for the model space for both MNIST and convex shapes. The paper closes with some discussion of ongoing and future work.
ISSN: 1749-4699
Computational Science & Discovery is an international, multidisciplinary journal focused on scientific advances and discovery through computational science in physics, chemistry, biology and applied science.
Computational Science & Discovery will cease publication with the 2015 volume. For further details on this decision, please visit https://ioppublishing.org/newsDetails/computational-science-and-discovery-to-cease-publication.
Open all abstracts, in this tab
James Bergstra et al 2015 Comput. Sci. Discov. 8 014008
Suraj S Deshpande et al 2013 Comput. Sci. Discov. 5 014016
The performance of the open source multiphase flow solver, interFoam, is evaluated in this work. The solver is based on a modified volume of fluid (VoF) approach, which incorporates an interfacial compression flux term to mitigate the effects of numerical smearing of the interface. It forms a part of the C + + libraries and utilities of OpenFOAM and is gaining popularity in the multiphase flow research community. However, to the best of our knowledge, the evaluation of this solver is confined to the validation tests of specific interest to the users of the code and the extent of its applicability to a wide range of multiphase flow situations remains to be explored. In this work, we have performed a thorough investigation of the solver performance using a variety of verification and validation test cases, which include (i) verification tests for pure advection (kinematics), (ii) dynamics in the high Weber number limit and (iii) dynamics of surface tension-dominated flows. With respect to (i), the kinematics tests show that the performance of interFoam is generally comparable with the recent algebraic VoF algorithms; however, it is noticeably worse than the geometric reconstruction schemes. For (ii), the simulations of inertia-dominated flows with large density ratios yielded excellent agreement with analytical and experimental results. In regime (iii), where surface tension is important, consistency of pressure–surface tension formulation and accuracy of curvature are important, as established by Francois et al (2006 J. Comput. Phys. 213 141–73). Several verification tests were performed along these lines and the main findings are: (a) the algorithm of interFoam ensures a consistent formulation of pressure and surface tension; (b) the curvatures computed by the solver converge to a value slightly (10%) different from the analytical value and a scope for improvement exists in this respect. To reduce the disruptive effects of spurious currents, we followed the analysis of Galusinski and Vigneaux (2008 J. Comput. Phys. 227 6140–64) and arrived at the following criterion for stable capillary simulations for interFoam: where . Finally, some capillary flows relevant to atomization were simulated, resulting in good agreement with the results from the literature.
Lion Krischer et al 2015 Comput. Sci. Discov. 8 014003
The Python libraries NumPy and SciPy are extremely powerful tools for numerical processing and analysis well suited to a large variety of applications. We developed ObsPy (http://obspy.org), a Python library for seismology intended to facilitate the development of seismological software packages and workflows, to utilize these abilities and provide a bridge for seismology into the larger scientific Python ecosystem. Scientists in many domains who wish to convert their existing tools and applications to take advantage of a platform like the one Python provides are confronted with several hurdles such as special file formats, unknown terminology, and no suitable replacement for a non-trivial piece of software. We present an approach to implement a domain-specific time series library on top of the scientific NumPy stack. In so doing, we show a realization of an abstract internal representation of time series data permitting I/O support for a diverse collection of file formats. Then we detail the integration and repurposing of well established legacy codes, enabling them to be used in modern workflows composed in Python. Finally we present a case study on how to integrate research code into ObsPy, opening it to the broader community. While the implementations presented in this work are specific to seismology, many of the described concepts and abstractions are directly applicable to other sciences, especially to those with an emphasis on time series analysis.
Geoffrey M Poore 2015 Comput. Sci. Discov. 8 014010
PythonTeX is a LaTeX package that allows Python code in LaTeX documents to be executed and provides access to the output. This makes possible reproducible documents that combine results with the code required to generate them. Calculations and figures may be next to the code that created them. Since code is adjacent to its output in the document, editing may be more efficient. Since code output may be accessed programmatically in the document, copy-and-paste errors are avoided and output is always guaranteed to be in sync with the code that generated it. This paper provides an introduction to PythonTeX and an overview of major features, including performance optimizations, debugging tools, and dependency tracking. Several complete examples are presented. Finally, advanced features are summarized. Though PythonTeX was designed for Python, it may be extended to support additional languages; support for the Ruby and Julia languages is already included. PythonTeX contains a utility for converting documents into plain LaTeX, suitable for format conversion, sharing, and journal submission.
The SunPy Community et al 2015 Comput. Sci. Discov. 8 014009
This paper presents SunPy (version 0.5), a community-developed Python package for solar physics. Python, a free, cross-platform, general-purpose, high-level programming language, has seen widespread adoption among the scientific community, resulting in the availability of a large number of software packages, from numerical computation (NumPy, SciPy) and machine learning (scikit-learn) to visualization and plotting (matplotlib). SunPy is a data-analysis environment specializing in providing the software necessary to analyse solar and heliospheric data in Python. SunPy is open-source software (BSD licence) and has an open and transparent development workflow that anyone can contribute to. SunPy provides access to solar data through integration with the Virtual Solar Observatory (VSO), the Heliophysics Event Knowledgebase (HEK), and the HELiophysics Integrated Observatory (HELIO) webservices. It currently supports image data from major solar missions (e.g., SDO, SOHO, STEREO, and IRIS), time-series data from missions such as GOES, SDO/EVE, and PROBA2/LYRA, and radio spectra from e-Callisto and STEREO/SWAVES. We describe SunPyʼs functionality, provide examples of solar data analysis in SunPy, and show how Python-based solar data-analysis can leverage the many existing tools already available in Python. We discuss the future goals of the project and encourage interested users to become involved in the planning and development of SunPy.
Michael Gittings et al 2008 Comput. Sci. Discov. 1 015005
We describe RAGE, the 'radiation adaptive grid Eulerian' radiation-hydrodynamics code, including its data structures, its parallelization strategy and performance, its hydrodynamic algorithm(s), its (gray) radiation diffusion algorithm, and some of the considerable amount of verification and validation efforts. The hydrodynamics is a basic Godunov solver, to which we have made significant improvements to increase the advection algorithm's robustness and to converge stiffnesses in the equation of state. Similarly, the radiation transport is a basic gray diffusion, but our treatment of the radiation–material coupling, wherein we converge nonlinearities in a novel manner to allow larger timesteps and more robust behavior, can be applied to any multi-group transport algorithm.
Jaydeep P Bardhan 2013 Comput. Sci. Discov. 5 013001
We review the mathematical and computational foundations for implicit-solvent models in theoretical chemistry and molecular biophysics. These models are valuable theoretical tools for studying the influence of a solvent, often water or an aqueous electrolyte, on a molecular solute such as a protein. Detailed chemical and physical aspects of implicit-solvent models have been addressed in numerous exhaustive reviews, as have numerical algorithms for simulating the most popular models. This work highlights several important conceptual developments, focusing on selected works that spotlight the need for research at the intersections between chemical, biological, mathematical, and computational physics. To introduce the field to computational scientists, we begin by describing the basic theoretical ideas of implicit-solvent models and numerical implementations. We then address practical and philosophical challenges in parameterization, and major advances that speed up calculations (covering continuum theories based on Poisson as well as faster approximate theories such as generalized Born). We briefly describe the main shortcomings of existing models, and survey promising developments that deliver improved realism in a computationally tractable way, i.e. without increasing simulation time significantly. The review concludes with a discussion of ongoing modeling challenges and relevant trends in high-performance computing and computational science.
Matthew G Reuter and Judith C Hill 2013 Comput. Sci. Discov. 5 014009
We present an algorithm for computing any block of the inverse of a block tridiagonal, nearly block Toeplitz matrix (defined as a block tridiagonal matrix with a small number of deviations from the purely block Toeplitz structure). By exploiting both the block tridiagonal and the nearly block Toeplitz structures, this method scales independently of the total number of blocks in the matrix and linearly with the number of deviations. Numerical studies demonstrate this scaling and the advantages of our method over alternatives.
J H Chen et al 2009 Comput. Sci. Discov. 2 015001
Computational science is paramount to the understanding of underlying processes in internal combustion engines of the future that will utilize non-petroleum-based alternative fuels, including carbon-neutral biofuels, and burn in new combustion regimes that will attain high efficiency while minimizing emissions of particulates and nitrogen oxides. Next-generation engines will likely operate at higher pressures, with greater amounts of dilution and utilize alternative fuels that exhibit a wide range of chemical and physical properties. Therefore, there is a significant role for high-fidelity simulations, direct numerical simulations (DNS), specifically designed to capture key turbulence-chemistry interactions in these relatively uncharted combustion regimes, and in particular, that can discriminate the effects of differences in fuel properties. In DNS, all of the relevant turbulence and flame scales are resolved numerically using high-order accurate numerical algorithms. As a consequence terascale DNS are computationally intensive, require massive amounts of computing power and generate tens of terabytes of data. Recent results from terascale DNS of turbulent flames are presented here, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame. Computing at this scale requires close collaborations between computer and combustion scientists to provide optimized scaleable algorithms and software for terascale simulations, efficient collective parallel I/O, tools for volume visualization of multiscale, multivariate data and automating the combustion workflow. The enabling computer science, applied to combustion science, is also required in many other terascale physics and engineering simulations. In particular, performance monitoring is used to identify the performance of key kernels in the DNS code, S3D and especially memory intensive loops in the code. Through the careful application of loop transformations, data reuse in cache is exploited thereby reducing memory bandwidth needs, and hence, improving S3D's nodal performance. To enhance collective parallel I/O in S3D, an MPI-I/O caching design is used to construct a two-stage write-behind method for improving the performance of write-only operations. The simulations generate tens of terabytes of data requiring analysis. Interactive exploration of the simulation data is enabled by multivariate time-varying volume visualization. The visualization highlights spatial and temporal correlations between multiple reactive scalar fields using an intuitive user interface based on parallel coordinates and time histogram. Finally, an automated combustion workflow is designed using Kepler to manage large-scale data movement, data morphing, and archival and to provide a graphical display of run-time diagnostics.
Serge Guelton et al 2015 Comput. Sci. Discov. 8 014001
Pythran is an open source static compiler that turns modules written in a subset of Python language into native ones. Assuming that scientific modules do not rely much on the dynamic features of the language, it trades them for powerful, possibly inter-procedural, optimizations. These optimizations include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical idioms such as expression templates. Unlike the Cython approach, Pythran input code remains compatible with the Python interpreter. Output code is generally as efficient as the annotated Cython equivalent, if not more, but without the backward compatibility loss.
Open all abstracts, in this tab
The SunPy Community et al 2015 Comput. Sci. Discov. 8 014009
This paper presents SunPy (version 0.5), a community-developed Python package for solar physics. Python, a free, cross-platform, general-purpose, high-level programming language, has seen widespread adoption among the scientific community, resulting in the availability of a large number of software packages, from numerical computation (NumPy, SciPy) and machine learning (scikit-learn) to visualization and plotting (matplotlib). SunPy is a data-analysis environment specializing in providing the software necessary to analyse solar and heliospheric data in Python. SunPy is open-source software (BSD licence) and has an open and transparent development workflow that anyone can contribute to. SunPy provides access to solar data through integration with the Virtual Solar Observatory (VSO), the Heliophysics Event Knowledgebase (HEK), and the HELiophysics Integrated Observatory (HELIO) webservices. It currently supports image data from major solar missions (e.g., SDO, SOHO, STEREO, and IRIS), time-series data from missions such as GOES, SDO/EVE, and PROBA2/LYRA, and radio spectra from e-Callisto and STEREO/SWAVES. We describe SunPyʼs functionality, provide examples of solar data analysis in SunPy, and show how Python-based solar data-analysis can leverage the many existing tools already available in Python. We discuss the future goals of the project and encourage interested users to become involved in the planning and development of SunPy.
Geoffrey M Poore 2015 Comput. Sci. Discov. 8 014010
PythonTeX is a LaTeX package that allows Python code in LaTeX documents to be executed and provides access to the output. This makes possible reproducible documents that combine results with the code required to generate them. Calculations and figures may be next to the code that created them. Since code is adjacent to its output in the document, editing may be more efficient. Since code output may be accessed programmatically in the document, copy-and-paste errors are avoided and output is always guaranteed to be in sync with the code that generated it. This paper provides an introduction to PythonTeX and an overview of major features, including performance optimizations, debugging tools, and dependency tracking. Several complete examples are presented. Finally, advanced features are summarized. Though PythonTeX was designed for Python, it may be extended to support additional languages; support for the Ruby and Julia languages is already included. PythonTeX contains a utility for converting documents into plain LaTeX, suitable for format conversion, sharing, and journal submission.
James Bergstra et al 2015 Comput. Sci. Discov. 8 014007
Machine learning benchmark data sets come in all shapes and sizes, whereas classification algorithms assume sanitized input, such as (x, y) pairs with vector-valued input x and integer class label y. Researchers and practitioners know all too well how tedious it can be to get from the URL of a new data set to a NumPy ndarray suitable for e.g. pandas or sklearn. The SkData library handles that work for a growing number of benchmark data sets (small and large) so that one-off in-house scripts for downloading and parsing data sets can be replaced with library code that is reliable, community-tested, and documented. The SkData library also introduces an open-ended formalization of training and testing protocols that facilitates direct comparison with published research. This paper describes the usage and architecture of the SkData library.
James Bergstra et al 2015 Comput. Sci. Discov. 8 014008
Sequential model-based optimization (also known as Bayesian optimization) is one of the most efficient methods (per function evaluation) of function minimization. This efficiency makes it appropriate for optimizing the hyperparameters of machine learning algorithms that are slow to train. The Hyperopt library provides algorithms and parallelization infrastructure for performing hyperparameter optimization (model selection) in Python. This paper presents an introductory tutorial on the usage of the Hyperopt library, including the description of search spaces, minimization (in serial and parallel), and the analysis of the results collected in the course of minimization. This paper also gives an overview of Hyperopt-Sklearn, a software project that provides automatic algorithm configuration of the Scikit-learn machine learning library. Following Auto-Weka, we take the view that the choice of classifier and even the choice of preprocessing module can be taken together to represent a single large hyperparameter optimization problem. We use Hyperopt to define a search space that encompasses many standard components (e.g. SVM, RF, KNN, PCA, TFIDF) and common patterns of composing them together. We demonstrate, using search algorithms in Hyperopt and standard benchmarking data sets (MNIST, 20-newsgroups, convex shapes), that searching this space is practical and effective. In particular, we improve on best-known scores for the model space for both MNIST and convex shapes. The paper closes with some discussion of ongoing and future work.
Ursula Iturrarán-Viveros and Miguel Molero-Armenta 2015 Comput. Sci. Discov. 8 014006
Graphics processing units (GPUs) have become increasingly powerful in recent years. Programs exploring the advantages of this architecture could achieve large performance gains and this is the aim of new initiatives in high performance computing. The objective of this work is to develop an efficient tool to model 2D elastic wave propagation on parallel computing devices. To this end, we implement the elastodynamic finite integration technique, using the industry open standard open computing language (OpenCL) for cross-platform, parallel programming of modern processors, and an open-source toolkit called [Py]OpenCL. The code written with [Py]OpenCL can run on a wide variety of platforms; it can be used on AMD or NVIDIA GPUs as well as classical multicore CPUs, adapting to the underlying architecture. Our main contribution is its implementation with local and global memory and the performance analysis using five different computing devices (including Kepler, one of the fastest and most efficient high performance computing technologies) with various operating systems.
Open all abstracts, in this tab
The SunPy Community et al 2015 Comput. Sci. Discov. 8 014009
This paper presents SunPy (version 0.5), a community-developed Python package for solar physics. Python, a free, cross-platform, general-purpose, high-level programming language, has seen widespread adoption among the scientific community, resulting in the availability of a large number of software packages, from numerical computation (NumPy, SciPy) and machine learning (scikit-learn) to visualization and plotting (matplotlib). SunPy is a data-analysis environment specializing in providing the software necessary to analyse solar and heliospheric data in Python. SunPy is open-source software (BSD licence) and has an open and transparent development workflow that anyone can contribute to. SunPy provides access to solar data through integration with the Virtual Solar Observatory (VSO), the Heliophysics Event Knowledgebase (HEK), and the HELiophysics Integrated Observatory (HELIO) webservices. It currently supports image data from major solar missions (e.g., SDO, SOHO, STEREO, and IRIS), time-series data from missions such as GOES, SDO/EVE, and PROBA2/LYRA, and radio spectra from e-Callisto and STEREO/SWAVES. We describe SunPyʼs functionality, provide examples of solar data analysis in SunPy, and show how Python-based solar data-analysis can leverage the many existing tools already available in Python. We discuss the future goals of the project and encourage interested users to become involved in the planning and development of SunPy.
Jaydeep P Bardhan 2013 Comput. Sci. Discov. 5 013001
We review the mathematical and computational foundations for implicit-solvent models in theoretical chemistry and molecular biophysics. These models are valuable theoretical tools for studying the influence of a solvent, often water or an aqueous electrolyte, on a molecular solute such as a protein. Detailed chemical and physical aspects of implicit-solvent models have been addressed in numerous exhaustive reviews, as have numerical algorithms for simulating the most popular models. This work highlights several important conceptual developments, focusing on selected works that spotlight the need for research at the intersections between chemical, biological, mathematical, and computational physics. To introduce the field to computational scientists, we begin by describing the basic theoretical ideas of implicit-solvent models and numerical implementations. We then address practical and philosophical challenges in parameterization, and major advances that speed up calculations (covering continuum theories based on Poisson as well as faster approximate theories such as generalized Born). We briefly describe the main shortcomings of existing models, and survey promising developments that deliver improved realism in a computationally tractable way, i.e. without increasing simulation time significantly. The review concludes with a discussion of ongoing modeling challenges and relevant trends in high-performance computing and computational science.
Tarek Echekki 2009 Comput. Sci. Discov. 2 013001
A principal challenge in modeling turbulent combustion flows is associated with their complex, multiscale nature. Traditional paradigms in the modeling of these flows have attempted to address this nature through different strategies, including exploiting the separation of turbulence and combustion scales and a reduced description of the composition space. The resulting moment-based methods often yield reasonable predictions of flow and reactive scalars' statistics under certain conditions. However, these methods must constantly evolve to address combustion at different regimes, modes or with dominant chemistries. In recent years, alternative multiscale strategies have emerged, which although in part inspired by the traditional approaches, also draw upon basic tools from computational science, applied mathematics and the increasing availability of powerful computational resources. This review presents a general overview of different strategies adopted for multiscale solutions of turbulent combustion flows. Within these strategies, some specific models are discussed or outlined to illustrate their capabilities and underlying assumptions. These strategies may be classified under four different classes, including (i) closure models for atomistic processes, (ii) multigrid and multiresolution strategies, (iii) flame-embedding strategies and (iv) hybrid large-eddy simulation–low-dimensional strategies. A combination of these strategies and models can potentially represent a robust alternative strategy to moment-based models; but a significant challenge remains in the development of computational frameworks for these approaches as well as their underlying theories.