Introduction to high performance computing, what is it, how to use it and when to use what. Provides a detailed checklist how to build pipelines and tips to optimize cluster usage and reduce waiting time in queue. It also provides a quick overview of resources available in Compute Canada.
High performance computing - building blocks, production & perspectiveJason Shih
This document provides an overview of high performance computing (HPC). It defines HPC as using supercomputers and computer clusters to solve advanced computation problems quickly and efficiently through parallel processing. The document discusses the building blocks of HPC systems including CPUs, memory, power consumption, and number of cores. It also outlines some common applications of HPC in fields like physics, engineering, and life sciences. Finally, it traces the evolution of HPC technologies over decades from early mainframes and supercomputers to today's clusters and parallel systems.
Esteban Hernandez is a PhD candidate researching heterogeneous parallel programming for weather forecasting. He has 12 years of experience in software architecture, including Linux clusters, distributed file systems, and high performance computing (HPC). HPC involves using the most efficient algorithms on high-performance computers to solve demanding problems. It is used for applications like weather prediction, fluid dynamics simulations, protein folding, and bioinformatics. Performance is often measured in floating point operations per second. Parallel computing using techniques like OpenMP, MPI, and GPUs is key to HPC. HPC systems are used across industries for applications like supply chain optimization, seismic data processing, and drug development.
The document discusses the history and development of high performance computing. It describes how early computers were mechanical devices, then became electronic and digital. It also summarizes the development of parallel and cluster computing technologies that allow multiple processors to work together on problems.
The document discusses parallel computing on the GPU. It outlines the goals of achieving high performance, energy efficiency, functionality, and scalability. It then covers the tentative schedule, which includes introductions to GPU computing, CUDA, threading and memory models, performance, and floating point considerations. It recommends textbooks and notes for further reading. It discusses key concepts like parallelism, latency vs throughput, bandwidth, and how GPUs were designed for throughput rather than latency like CPUs. Winning applications are said to use both CPUs and GPUs, with CPUs for sequential parts and GPUs for parallel parts.
High Performance Computing Presentationomar altayyan
The Presentation Delivered on 3-6-2018 in the Data Mining Course, AI Specialization, at the Faculty of Information Technology Engineering Damascus University Paper Link: https://shamra.sy/academia/show/5b0c790de9fc6
Presentació a càrrec d'Ismael Fernández i Cristian Gomollón (tècnics d'Aplicacions al CSUC) duta a terme a la jornada de formació "Com usar el servei de càlcul del CSUC" celebrada el 8 d'octubre de 2019 al CSUC.
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Presentació a càrrec d'Ismael Fernández, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de desembre de 2021 en format virtual.
Kvm performance optimization for ubuntuSim Janghoon
This document discusses various techniques for optimizing KVM performance on Linux systems. It covers CPU and memory optimization through techniques like vCPU pinning, NUMA affinity, transparent huge pages, KSM, and virtio_balloon. For networking, it discusses vhost-net, interrupt handling using MSI/MSI-X, and NAPI. It also covers block device optimization through I/O scheduling, cache mode, and asynchronous I/O. The goal is to provide guidance on configuring these techniques for workloads running in KVM virtual machines.
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
Evan Powell presented this deck at the MSST 2107 Mass Storage Conference. "What is so new about the container environment that a new class of storage software is emerging to address these use cases? And can container orchestration systems themselves be part of the solution? As is often the case in storage, metadata matters here. We are implementing in the open source OpenEBS.io some approaches that are in some regards inspired by ZFS to enable much more efficient scale out block storage for containers that itself is containerized. The goal is to enable storage to be treated in many regards as just another application while, of course, also providing storage services to stateful applications in the environment." Watch the video: http://wp.me/p3RLHQ-gPs Learn more: blog.openebs.io and http://storageconference.us Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses current trends in high performance computing. It begins with an introduction to high performance computing and its applications in science, engineering, business analysis, and more. It then discusses why high performance computing is needed due to changes in scientific discovery, the need to solve larger problems, and modern business needs. The document also discusses the top 500 supercomputers in the world and provides examples of some of the most powerful systems. It then covers performance development trends and challenges in increasing processor speeds. The rest of the document discusses parallel computing approaches using multi-core and many-core architectures, as well as cluster, grid, and cloud computing models for high performance.
Presentació a càrrec d'Ismael Fernández i Cristian Gomollón (tècnics d'Aplicacions al CSUC) duta a terme a la "3a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 29 d'octubre de 2020 en format virtual.
The document discusses processes, including: - A process is an active entity that represents a program in execution, loaded into memory. It requires resources like CPU time and memory to accomplish tasks. - Modern systems allow multiple processes to run concurrently by switching the CPU between processes rapidly. Processes share system resources. - A process' state includes information stored in its process control block like the program counter, registers, scheduling information, memory allocation, and I/O status. During a context switch, the OS saves a process' context and restores another.
The document analyzes the performance of Google's Tensor Processing Unit (TPU) compared to CPUs and GPUs for neural network inference workloads. It finds that the TPU, an ASIC designed specifically for neural network operations, achieves a 25-30x speedup over CPUs and GPUs. This is due to the TPU having many more simple integer math cores and on-chip memory optimized for neural network computations. The document concludes the TPU is 30-80x more energy efficient than other hardware and its performance could increase further with higher memory bandwidth.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency. In this talk we revisit the changes the Linux block layer in the last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux. Sagi Grimberg Sagi is Principal Architect and co-founder at LightBits Labs.
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
This document discusses new graphics APIs like DX12 and Vulkan that aim to provide lower overhead and more direct hardware access compared to earlier APIs. It covers topics like increased parallelism, explicit memory management using descriptor sets and pipelines, and best practices like batching draw calls and using multiple asynchronous queues. Overall, the new APIs allow more explicit control over GPU hardware for improved performance but require following optimization best practices around areas like parallelism, memory usage, and command batching.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Ceph scale testing with 10 Billion ObjectsKaran Singh
In this performance testing, we ingested 10 Billion objects into the Ceph Object Storage system and measured its performance. We have observed deterministic performance, check out this presentation to know the details.
GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
Redis is an in-memory key-value store that can be used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, sorted sets, with commands to add, remove, and get values. Redis works with an optional disk storage for persistence and supports master-slave replication for high availability. Common use cases include caching, queues, user sessions, and real-time analytics.
Profiling your Applications using the Linux Perf ToolsemBO_Conference
This document provides an overview of using the Linux perf tools to profile applications. It discusses setting up perf, benchmarking applications, profiling both CPU usage and sleep times, and analyzing profiling data. The document covers perf commands like perf record to collect profiling data, perf report to analyze the data, and perf script to convert it to other formats. It also discusses profiling options like call graphs and collecting kernel vs. user mode events.
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
This document discusses an all-flash Ceph array design from QCT based on NUMA architecture. It provides an agenda that covers all-flash Ceph and use cases, QCT's all-flash Ceph solution for IOPS, an overview of QCT's lab environment and detailed architecture, and the importance of NUMA. It also includes sections on why all-flash storage is used, different all-flash Ceph use cases, QCT's IOPS-optimized all-flash Ceph solution, benefits of using NVMe storage, QCT's lab test environment, Ceph tuning recommendations, and benefits of using multi-partitioned NVMe SSDs for Ceph OSDs.
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
Todd Lipcon presents a solution to avoid full garbage collections (GCs) in HBase by using MemStore-Local Allocation Buffers (MSLABs). The document outlines that write operations in HBase can cause fragmentation in the old generation heap, leading to long GC pauses. MSLABs address this by allocating each MemStore's data into contiguous 2MB chunks, eliminating fragmentation. When MemStores flush, the freed chunks are large and contiguous. With MSLABs enabled, the author saw basically zero full GCs during load testing. MSLABs improve performance and stability by preventing GC pauses caused by fragmentation.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML. Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel. Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud. Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Presentació a càrrec d'Ismael Fernández, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de desembre de 2021 en format virtual.
Kvm performance optimization for ubuntuSim Janghoon
This document discusses various techniques for optimizing KVM performance on Linux systems. It covers CPU and memory optimization through techniques like vCPU pinning, NUMA affinity, transparent huge pages, KSM, and virtio_balloon. For networking, it discusses vhost-net, interrupt handling using MSI/MSI-X, and NAPI. It also covers block device optimization through I/O scheduling, cache mode, and asynchronous I/O. The goal is to provide guidance on configuring these techniques for workloads running in KVM virtual machines.
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
Evan Powell presented this deck at the MSST 2107 Mass Storage Conference. "What is so new about the container environment that a new class of storage software is emerging to address these use cases? And can container orchestration systems themselves be part of the solution? As is often the case in storage, metadata matters here. We are implementing in the open source OpenEBS.io some approaches that are in some regards inspired by ZFS to enable much more efficient scale out block storage for containers that itself is containerized. The goal is to enable storage to be treated in many regards as just another application while, of course, also providing storage services to stateful applications in the environment." Watch the video: http://wp.me/p3RLHQ-gPs Learn more: blog.openebs.io and http://storageconference.us Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses current trends in high performance computing. It begins with an introduction to high performance computing and its applications in science, engineering, business analysis, and more. It then discusses why high performance computing is needed due to changes in scientific discovery, the need to solve larger problems, and modern business needs. The document also discusses the top 500 supercomputers in the world and provides examples of some of the most powerful systems. It then covers performance development trends and challenges in increasing processor speeds. The rest of the document discusses parallel computing approaches using multi-core and many-core architectures, as well as cluster, grid, and cloud computing models for high performance.
Presentació a càrrec d'Ismael Fernández i Cristian Gomollón (tècnics d'Aplicacions al CSUC) duta a terme a la "3a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 29 d'octubre de 2020 en format virtual.
The document discusses processes, including: - A process is an active entity that represents a program in execution, loaded into memory. It requires resources like CPU time and memory to accomplish tasks. - Modern systems allow multiple processes to run concurrently by switching the CPU between processes rapidly. Processes share system resources. - A process' state includes information stored in its process control block like the program counter, registers, scheduling information, memory allocation, and I/O status. During a context switch, the OS saves a process' context and restores another.
The document analyzes the performance of Google's Tensor Processing Unit (TPU) compared to CPUs and GPUs for neural network inference workloads. It finds that the TPU, an ASIC designed specifically for neural network operations, achieves a 25-30x speedup over CPUs and GPUs. This is due to the TPU having many more simple integer math cores and on-chip memory optimized for neural network computations. The document concludes the TPU is 30-80x more energy efficient than other hardware and its performance could increase further with higher memory bandwidth.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency. In this talk we revisit the changes the Linux block layer in the last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux. Sagi Grimberg Sagi is Principal Architect and co-founder at LightBits Labs.
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
This document discusses new graphics APIs like DX12 and Vulkan that aim to provide lower overhead and more direct hardware access compared to earlier APIs. It covers topics like increased parallelism, explicit memory management using descriptor sets and pipelines, and best practices like batching draw calls and using multiple asynchronous queues. Overall, the new APIs allow more explicit control over GPU hardware for improved performance but require following optimization best practices around areas like parallelism, memory usage, and command batching.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Ceph scale testing with 10 Billion ObjectsKaran Singh
In this performance testing, we ingested 10 Billion objects into the Ceph Object Storage system and measured its performance. We have observed deterministic performance, check out this presentation to know the details.
GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
Redis is an in-memory key-value store that can be used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, sorted sets, with commands to add, remove, and get values. Redis works with an optional disk storage for persistence and supports master-slave replication for high availability. Common use cases include caching, queues, user sessions, and real-time analytics.
Profiling your Applications using the Linux Perf ToolsemBO_Conference
This document provides an overview of using the Linux perf tools to profile applications. It discusses setting up perf, benchmarking applications, profiling both CPU usage and sleep times, and analyzing profiling data. The document covers perf commands like perf record to collect profiling data, perf report to analyze the data, and perf script to convert it to other formats. It also discusses profiling options like call graphs and collecting kernel vs. user mode events.
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
This document discusses an all-flash Ceph array design from QCT based on NUMA architecture. It provides an agenda that covers all-flash Ceph and use cases, QCT's all-flash Ceph solution for IOPS, an overview of QCT's lab environment and detailed architecture, and the importance of NUMA. It also includes sections on why all-flash storage is used, different all-flash Ceph use cases, QCT's IOPS-optimized all-flash Ceph solution, benefits of using NVMe storage, QCT's lab test environment, Ceph tuning recommendations, and benefits of using multi-partitioned NVMe SSDs for Ceph OSDs.
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
Todd Lipcon presents a solution to avoid full garbage collections (GCs) in HBase by using MemStore-Local Allocation Buffers (MSLABs). The document outlines that write operations in HBase can cause fragmentation in the old generation heap, leading to long GC pauses. MSLABs address this by allocating each MemStore's data into contiguous 2MB chunks, eliminating fragmentation. When MemStores flush, the freed chunks are large and contiguous. With MSLABs enabled, the author saw basically zero full GCs during load testing. MSLABs improve performance and stability by preventing GC pauses caused by fragmentation.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML. Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel. Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud. Presented by David Smith at Data Day Texas in Austin, January 27 2018.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications: 1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics. 2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost. 3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases. 4. Achieving exactly-once semantics requires techniques
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions: How do I manage offsets? How do I manage state? How do I make my spark streaming job resilient to failures? Can I avoid some failures? How do I gracefully shutdown my streaming job? How do I monitor and manage (e.g. re-try logic) streaming job? How can I better manage the DAG in my streaming job? When to use checkpointing and for what? When not to use checkpointing? Do I need a WAL when using streaming data source? Why? When don’t I need one? In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)Gary Jackson MBCS
This document provides information about SAP HANA System Replication (HSR) and compares it to SAP Replication Server (SRS). HSR replicates transaction log entries from a primary HANA database to secondary databases. It supports synchronous and asynchronous replication and can be used for high availability and disaster recovery. The document outlines the initial setup process and ongoing administration of HSR configurations.
This document discusses optimizing Linux AMIs for performance at Netflix. It begins by providing background on Netflix and explaining why tuning the AMI is important given Netflix runs tens of thousands of instances globally with varying workloads. It then outlines some of the key tools and techniques used to bake performance optimizations into the base AMI, including kernel tuning to improve efficiency and identify ideal instance types. Specific examples of CFS scheduler, page cache, block layer, memory allocation, and network stack tuning are also covered. The document concludes by discussing future tuning plans and an appendix on profiling tools like perf and SystemTap.
Architecting and productionising data science applications at scalesamthemonad
This document discusses architecting and productionizing data science applications at scale. It covers topics like parallel processing with Spark, streaming platforms like Kafka, and scalable machine learning approaches. It also discusses architectures for data pipelines and productionizing models, with a focus on automation, avoiding SQL databases, and using Kafka streams and Spark for batch and streaming workloads.
This document discusses Apache Kudu, an open source column-oriented storage system that provides fast analytics on fast data. It describes Kudu's design goals of high throughput for large scans, low latency for short accesses, and database-like semantics. The document outlines Kudu's architecture, including its use of columnar storage, replication for fault tolerance, and integrations with Spark, Impala and other frameworks. It provides examples of using Kudu for IoT and real-time analytics use cases. Performance comparisons show Kudu outperforming other NoSQL systems on analytics and operational workloads.
The document discusses Kudu, a new updatable columnar storage system for Hadoop that was built to address gaps in transactional and analytic capabilities of existing Hadoop storage technologies like HDFS and HBase. Kudu aims to provide both high throughput for large scans like HDFS and low latency for individual row lookups and updates like HBase, while supporting SQL queries and a relational data model. It leverages improvements in hardware by using a columnar format and indexes to improve CPU efficiency for these workloads compared to traditional storage systems. The document outlines Kudu's goals and capabilities and provides examples of use cases like time series analytics, machine data analytics and online reporting that would benefit from Kudu's simultaneous support for sequential
This presentation reviews of the many aspects of PHP performance that can impact day-to-day living. It explores basic concepts for resolution when PHP performance has got you down. The focus is on Zend Server configuration options including, but not limited to: caching, Apache settings, PHP syntax fundamentals, diagnosing bottlenecks, and DB2/SQL optimization.
Processing data from social media streams and sensors in real-time is becoming increasingly prevalent and there are plenty open source solutions to choose from. To help practitioners decide what to use when we compare three popular Apache projects allowing to do stream processing: Apache Storm, Apache Spark and Apache Samza.
Xin Wang(Apache Storm Committer/PMC member)'s topic covered the relations between streaming and messaging platform, and the challenges and tips in Storm usage.
High Performance Deep learning with Apache SparkRui Liu
Depp learning system is deployed as a service. And, Spark data pipeline is seamlessly connected with this service. Apache Arrow format is used for shared memory data transfer from Spark data pipeline to deep learning services.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
Designing & Optimizing micro batch processing system to handle multi-billion events using 100+ nodes of Cassandra , spark and Kafka - Lessons learned from the trenches Designing and Optimizing 20+ billion operations a day presents a set of complex challenges especially when the SLA is near real-time. In this presentation we will walk through our experience in building large scale event processing pipeline using Cassandra , spark streaming and kafka using 100+ nodes. We will present the Design patterns, development steps and diagnostics setups at the technology level and application level that are needed to manage the application of this scale. We also aim to present some unique problems we encountered in optimizing and operationalizing these environments. About the Speakers Ananth Ram Senior Principal / Senior Manager, Accenture Ananth Ram is a Solution Architect with over 17 years of experience in Oracle database Architecture and designing large scale applications. He was with Oracle Corp for nine years before joining Accenture as Senior Principal . As a part of Accenture, Ananth has been working on many large scale Oracle and big data initiatives in the last four years. Rich Rein Solution Architect, DataStax Rich Rein is a Solutions Architect from DataStax on Accenture team with over 30+ years as an architect, manager, and consultant in Silicon Valley's computing industry. Rumeel Kazi, Accenture Federal Rumeel Kazi is a Senior Manager in the Accenture Health & Public Service (H&PS) practice. He has over 17 years of Systems Integration implementation experience involving Oracle, J2EE platforms, Enterprise Application Integration, Supply Chain, ETL and Business Rules Management Systems. Rumeel has been working on large scale Oracle and big data application solutions since the last 5 years.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
This document discusses big data, Hadoop, and using Hadoop in the cloud via Amazon EMR. It provides an overview of big data and what Hadoop is, explains how Hadoop works and how it can help store and process large datasets. It then discusses how Amazon EMR can be used to deploy Hadoop clusters in the cloud without having to manage the underlying infrastructure, and provides instructions on setting up and using EMR. Finally, it discusses debugging, profiling, and performance tuning Hadoop jobs and EMR clusters.
Maximizing performance via tuning and optimizationMariaDB plc
Maximizing Performance via Tuning and Optimization outlines best practices for optimizing MariaDB server performance. It discusses: - Defining service level agreements and metrics to monitor against them - When to tune based on schema, query, or system changes - Ensuring server, storage, network and OS settings support database needs - Configuring connection pooling and threads to manage load - Common MariaDB configuration settings that impact performance - Query tuning techniques like indexing, monitoring tools, and database design
Maximizing performance via tuning and optimizationMariaDB plc
Maximizing performance via tuning and optimization involves: - Defining service level agreements and translating them to database transactions. - Capturing metrics on business, application, and database transactions to identify bottlenecks. - Tuning from the start and periodically reviewing production systems for changes. - Optimizing server, storage, network and OS settings as well as MariaDB configuration settings like buffer pool size, query cache size, and connection settings. - Analyzing slow queries, indexing appropriately, and monitoring tools like Performance Schema. - Designing databases and choosing optimal data types.
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
True streaming is fast becoming a necessity for many business use cases. On the other hand the data set sizes and volumes are also growing exponentially compounding the complexity of data processing pipelines.There exists a need for true low latency streaming coupled with very high throughput data processing. Apache Apex as a low latency and high throughput data processing framework and Apache Kudu as a high throughput store form a nice combination which solves this pattern very efficiently. This session will walk through a use case which involves writing a high throughput stream using Apache Kafka,Apache Apex and Apache Kudu. The session will start with a general overview of Apache Apex and capabilities of Apex that form the foundation for a low latency and high throughput engine with Apache kafka being an example input source of streams. Subsequently we walk through Kudu integration with Apex by walking through various patterns like end to end exactly once, selective column writes and timestamp propagations for out of band data. The session will also cover additional patterns that this integration will cover for enterprise level data processing pipelines. The session will conclude with some metrics for latency and throughput numbers for the use case that is presented. Speaker Ananth Gundabattula, Senior Architect, Commonwealth Bank of Australia
Anti-Riot_Drone_Phase-0(2)[1] major project.pptxneerajprajwal
This project introduces an autonomous anti-riot drone equipped with tear gas deployment and an electric net for non-lethal crowd control. It enables remote riot management through GPS navigation and live-streaming cameras, ensuring officer safety. The tear gas system disperses crowds, while the electric net restrains individuals, including escaped prisoners or suspects. Additionally, the drone can be adapted for wildlife capture using an anesthetic dart. By providing real-time surveillance and automated intervention, it enhances situational awareness and efficiency in law enforcement, minimizing risks to officers and civilians
This project introduces an autonomous anti-riot drone equipped with tear gas deployment and an electric net for non-lethal crowd control. It enables remote riot management through GPS navigation and live-streaming cameras, ensuring officer safety. The tear gas system disperses crowds, while the electric net restrains individuals, including escaped prisoners or suspects. Additionally, the drone can be adapted for wildlife capture using an anesthetic dart. By providing real-time surveillance and automated intervention, it enhances situational awareness and efficiency in law enforcement, minimizing risks to officers and civilians
The Smart AI Chatbot for Agriculture uses a variety of cutting-edge technologies to guarantee effectiveness, usability, and pertinence. The technology uses Deep Learning model that is trained on agricultural datasets to provide tailored recommendations. These models evaluate user inputs and provide customized guidance on irrigation techniques, crop selection, and pest control. To properly format agricultural data, the initial phase of data preparation is cleansing, normalization, and tokenization utilizing libraries like Pandas, NumPy, and NLTK. The chatbot is then trained to recognize intent and provide context-aware answers using this enhanced data. The chatbot uses PyTorch and TensorFlow to create machine learning models and comprehend natural language. Lastly, Speech Recognition and libraries are used to integrate a multilingual, voice-enabled interface, guaranteeing real-time, accessible interactions catered to farmers' needs.When combined, these strategies give the chatbot the ability to deliver actionable, region-specific, real-time information that help farmers make better decisions and increase agricultural output.India's farmers frequently struggle to obtain up-to-date, precise, and region-specific agricultural knowledge. Conventional information-dissemination techniques, such field trips or hotlines, need a lot of time and resources. Furthermore, the utility of current digital solutions is restricted by language hurdles and dispersed information sources. This eventually affects farmers' livelihoods and the agricultural economy by leading to poor farming techniques, low output, and lost opportunities for government assistance. 1.5 Objective of the work The objective of the current project is to create a Farmer Support ChatBot that will employ an interactive conversational interface to help farmers with their farming needs. By providing real-time advice on crucial topics including crop selection, pest control, weather forecasts, market trends, and government programs, the chatbot seeks to close the knowledge gap. Through an intuitive and cutting-edge technology solution, the project ultimately aims to increase resource efficiency, decrease reliance on middlemen, and improve farmers' overall standard of living. 1.6 Organization of the Project The Smart AI Chatbot for Agriculture is organized into several sections.The project begins with an Introduction, highlighting the motivation, challenges in agriculture, and the potential of AI in addressing these issues. Followed by an overview of Related work investigation in agricultural AI while identifying gaps that the chatbot aims to bridge. The Problem Statement and Objectives section clearly defines the issues faced by farmers and outlines the specific goals of the project. The Methodology section describes the approach to developing the chatbot, including data collection, model training, and the use of NLP for conversational capabilities.
call for Papers - 6th International Conference on Natural Language Computing ...CSEIJJournal
6th International Conference on Natural Language Computing and AI (NLCAI 2025) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Natural Language Computing, and AI. The Conference looks for significant contributions to all major fields of the Natural Language processing and machine learning in theoretical and practical aspects.
6th International Conference on Advances in Artificial Intelligence Techniques (ArIT 2025) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Artificial Intelligence and its advances. The Conference looks for significant contributions to all major fields of the Artificial Intelligence in theoretical and practical aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field
Quality Assurance Procedure – KBT Waterproofing Type 5 MembraneBrianBertelThomsen
A complete QA/QC overview of KBT Waterproofing’s synthetic membrane system for critical infrastructure. Includes adhesion testing, layer control, and final documentation.
safety moment for road project plan.pptxbaramasIsaac
High performance computing tutorial, with checklist and tips to optimize cluster usage
1. Introduction to high performance computing: what, when and how? Pradeep Reddy Raamana crossinvalidation.com
2. Raamana • Self-explanatory! • process a batch of jobs, in sequence! • non-interactive, to reduce idle time. • let’s face it: humans are slow!! • Reduces startup & shutdown times, when run separately. • Efficient use of resources (run when systems are idle) Batch processing 2
3. Raamana What is [not] HPC? ✓ Simply a multi-user, shared and smart batch processing system ✓ Improves the scale & size of processing significantly ✓ With raw power & parallelization ✓ Thanks to rapid advances in low cost micro-processors, high-speed networks and optimized software ✓ Imagine a big bulldozer! ✘ does not write your code! ✘ does not debug your code! ✘ does not speed up your code! ✘ does not think for you, or write your paper! 3Raamana
4. Raamana Components of HPC cluster Login node /Scheduler Terminal > cmd Cluster Parallel file system submit jobs data data • Just 3 things! • Headnode • Cluster • Filesystem 4
5. Raamana Is HPC a supercomputer? • No and Yes • Supercomputers —> a single very-super-large task • HPC —> many small tasks • by “high”, we typically mean “large amount” of performance 5 huge problem (mountain) super computer rock
6. Raamana Benefits of HPC cluster • Cost-effective • Much cheaper than a super-computer with the same amount of computing power! • When the supercomputer crashes, everything crashes! • When a single/few nodes in HPC fail, cluster continues to function. • Highly scalable • Multi-user shared environment: not everyone needs all the computing power all the time. • higher utilization: can accommodate variety of workloads (#CPUs, memory etc), at the same time. • Can be expanded or shrunk, as needed. 6
11. Raamana Scheduler • Allocates jobs to nodes (i.e. time on resources available) • Applies priorities to jobs, according to policies and usage • Enforces limits on usage (restricts jobs to its spec) • Coordinates with the resource manager (accounting etc) • Manages different queues within a cluster • customized with different limits on memory, CPU speed and number of parallel tasks etc. • Manages dependencies (different “steps” within the same job)! 11
12. Raamana Cluster or farm • Receive jobs from master … • Serve the master, obediently • Take a break, once in a while. 12 General Queue Heavy
14. Raamana File-system • Major roles: • reduce latency in read/write • perform regular backup • Enable concurrent access • to all nodes • to all users • Amazing engineering behind! 14 /home /scratch /work
15. Raamana When to use HPC? • When the task is too big (memory) to fit on your own desktop computer! • When you have many small tasks with different parameters! • same task, many different subjects or conditions etc. • same pattern of computing, if not same task. • When your jobs ran too long (months!) • Need > 1 terabyte of disk space • High-speed data access - really high! • No downsides in using it in most cases!! 15 > big node on cluster Cluster
16. Raamana total time develop reliability ease of use ease of use develop reliability total time Should I use HPC? 16 HPC laptop Over full project timeline publish, revisions!!not just till first result
17. Raamana When to avoid HPC? • When interaction is a big part! • When visualization is a big part! • When you are still “improving” algorithm • Debugging, profiling and optimizing code • BUT, sometimes you need to deploy on big nodes to test them. Then its necessary. • PS: interaction and visualization are both possible - just need more effort to setup. 17 > interact still waiting … > visualize display missing.. error!
18. Raamana Types of schedulers 18 SGE Torque/PBS SLURM Variables Easy Hard Easier Features Reasonable Reasonable Many Support Low Low Well-supported Administration Not hard Hard Easy Scalability Medium Low High (millions of jobs) Popularity Okay Not very Highly popular *Raamana’s personal opinion
19. Raamana Job Priority 19 Factor Impact First-come first-serve jobs waiting longer get priority Size of resources requested smaller get priority, typically “whole node” jobs are preferred (over partial use) Fair-share groups/users with lesser usage get priority Resource allocation users or groups with prior allocations get higher priority. Must be set-up in advance. *Terms and conditions apply e.g. dependences, availability of resources, type of queue requested etc https://docs.computecanada.ca/wiki/Job_scheduling_policies
20. Raamana Resource specification 20 Resource SLURM SGE number of nodes -N [min[-max]] N/A number of CPUs -n [count] -pe [PE] [count] memory (RAM) --mem [size[units]] -l mem_free=[size[units]] total time (wall clock limit) -t [days-hh:mm:ss] OR -t [min] -l h_rt=[seconds] export user environment --export=[ALL | NONE | variables] -V naming a job (important) --job-name=[name] -N [name] output log (stdout) -o [file_name] -o [file_name] error log (stderr) -e [file_name] -e [file_name] join stdout and stderr by default, unless -e specified -j yes queue / partition -p [queue] -q [queue] script directive (inside script) #SBATCH #$ job notification via email --mail-type=[events] -m abe email address for notifications --mail-user=[address] -M [address] Useful glossary: https://www.computecanada.ca/research-portal/accessing-resources/glossary/
21. Raamana Node specification 21 Resource SLURM restrict to particular nodes --nodelist=intel[1-5] exclude certain nodes --exclude=amd[6-9] based on features (tags) --constraint=“intel&gpu” to a specific partition or queue --partition intel_gpu based on number of cores/threads --extra-node-info=<sockets[:cores[:threads]]> type of computation --hint=[compute_bound,memory_bound,multithread] contiguous --contiguous CPU frequency --cpu-freq=[Performance,Conservative,PowerSave] Useful glossary: https://www.computecanada.ca/research-portal/accessing-resources/glossary/
22. Raamana #!/bin/bash #SBATCH -p general # which partition/queue #SBATCH -N 1 # number of nodes #SBATCH -n 1 # number of cores #SBATCH --mem=4G # total memory #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o my_output.txt # command to invoke your script python /home/quark/script.py R CMD BATCH stats.R matlab -nodesktop -r matrix.m Making a job from a script 22 > sbatch -n 1 --mem=4G -t 2:00 -o my_output.txt python /home/quark/script.py Recommended
23. Raamana Invoking script from shell 23 Language / environment Shell command shell script bash script.sh python python script.py matlab matlab -nodesktop -r script.m R R CMD BATCH script.R
24. Raamana Being precise and smaller is wiser! (backfill policy) 24 # CPUs 32 16 8 6 4 2 1 with backfill, you can look ahead, fill unused gaps with jobs waiting in queue. running job! Tail of Queue jobs wait till resources free up! Head of queue jobs to the left get executed without backfill, queue or priority order would be strict. Cluster under-utilized! time *backfill might not always be enabled. Assumes users specified resources! hence, an army of small jobs is better than few big jobs!* otherwise unused CPUs
26. Raamana Splitting your workflow 26 • Some tasks are highly parallel • painting different walls • Some tasks have to wait for others • Installing roof, needs all walls built first.
27. Raamana Slicing Total Processing 27 Task parallelismData parallelism you can do both, although with varying returns! Note: these are mostly embarrassingly parallel!
28. Raamana Where to speed up? 28 green task (parallelizable) red task (sequential, can NOT be parallelized) Fully Sequential Make B 5x faster Make A 5x faster Time in minutes 0 25 50 75 100 whole task
32. Raamana Available Resources in Compute Canada https://www.computecanada.ca/research-portal/accessing-resources/available-resources/ 32 Arbutus (cloud) More fine-grained control over software stack, OS, size etc. Web Portal. Like Amazon EC2. Long processing over batch processing. Compute Ontario (some systems will be decommissioned soon)
33. Raamana Compute Canada: Available Resources https://www.computecanada.ca/research-portal/accessing-resources/available-resources/ 33 Cedar Niagara (parallel cluster) will soon be operational in 2018 Graham
34. Raamana Centre for Advanced Computing (Frontenac) • Brand-new! • Excellent support/admins. • SLURM scheduler, not SGE. • Info: https://cac.queensu.ca • Available software: https://cac.queensu.ca/wiki/index.php/Software:Frontenac 34 Path / Area Quota Backup? Purged? /global/home 1 TB Yes No /global/scratch 5 TB No Yes /global/project 3 TB Yes No /local 1 TB No Yes
35. Raamana Rotman Grid (G4) • Convenient, if the data is already stored here at Rotman • 200+ cores, >1.2T RAM, >100T disk space • Fewer users, less competition, sometimes. • Queues from 4CPUs/6GB to 8CPUs/16GB • Other configurations up to 60CPUs/360GB can be easily defined. • Admins are highly reachable. • This will get bigger, talk to Tony and Alain. 35 Path / Area Quota Backup? /home 1 TB Yes /scratch 5 TB No /dataX X TB No gateway or headnode IP: 172.24.4.65
36. Raamana Checklist: before you submit • Test and debug your code locally • starting with each of the small parts of the pipeline • whether they are integrated well • reduce redundancy, choosing right output formats etc • sloppy testing and debugging could cost you a lot, later on!! • Test your environment • Run the job locally on the headnode or login node • If not, you can request an interactive job • Do I have enough disk space? • Chalk out job requirements in speed, walltime, RAM, number of jobs etc • to reduce the total processing time (at the level of dataset and experiment) • You many need to select appropriate queue or partition to match your needs and specifications (otherwise you might wait in line forever • Decide on whether to insert checkpoint logic and code • Decide on whether to insert Profiling code (measure its effective speed in different parts of pipeline) • Decide on whether to retain intermediate or scratch outputs? 36
37. Raamana Checklist: before you submit • Always try to specify resources! • defaults are not necessarily the best for your need! • job gets scheduled quickly, choosing right queue/specs. • reduces the trial and error to get to the right nodes w/ resources • reduces wastage - don’t take up 8 CPUs and 32Gb to print(“Hello, World!”) • “Know your job” well (profiling!) • Save the job specifications to a file (do not rely on shell history) • Estimate requirements precisely, but be conservative in requesting - add 10-20% • If your matrix needs 2.5GB, specify 4GB for job. • Remember OS on the nodes takes up some RAM - so if the node physically has 32GB, it needs a 2-3GB to run and stay alive. Only jobs requiring less than 30GB will be scheduled to it. Jobs requiring exactly 32GB will be sent to nodes with more than RAM (64 or 128GB) 37
38. Raamana Checklist: profile your job • Many tools are available in Linux to “profile” the memory usage and time usage for different parts of your program. • top / htop • free / vmstat • time • Plugins to IDEs for your language • Explicit profiling is typically not necessary! • you know your job during development. • Check the file sizes created during “trial” runs • Keep only what is necessary, after testing! • You don’t need to specify disk space, but need to ensure you won’t create more files exceeding your quota (aggregate over all jobs) • tools: quota, du or df -h • If you are unable to profile on your desktop, • request an interactive job! • Once obtained, acts like your desktop! • Need to think about whether you need a display, when you run jobs! 38
39. Raamana • Regularly check on job status • because jobs fail! Many reasons! It sucks. It hurts. No matter how well you tested your code! • Some factors (like network, file system and weather) are not in your control. • Better to accept failures, and reduce the time to resubmit them. • Hence checkpointing is important! • so you reuse what was finished already before failure! • You may need to write scripts to get an accurate estimate of status of processing! • as your pipeline can be complicated • rely on files written to disk, than text output in a log • unless you designed it that way 39 Checklist: during execution
40. Raamana • Check various things!! • Check file sizes (file being there doesn’t mean it has data) • Visualize them (data present doesn’t mean its accurate) • Sweep across all jobs! • Check disk usage. • Track usage: • memory, walltime, disk I/O etc. • to optimize job specs next time • as it’s never a one time thing! • Again, scripts can help • automating this process • mad shell skills also help. 40 Checklist: after execution
41. Raamana Checklist 41 Before Test and profile code, locally! Run a test job to test environment & config Chalk out requirements During Look for any failures! Monitor usage! Resubmit, correcting any simple mistakes After Check logs & outputs Assume the worst! Visualize and verify, do not assume! Do Automate checks when possible! Identify areas for optimization (repkg) Regular cleanups (shared file systems) Avoid Don’t create too many small files! Avoid ASCII (text) format for large files Relative paths (use absolute paths) Don’t use MS Word (hidden characters): use text editor or vi
43. Raamana Data transfer tools 43 Task Recommended Alternatives download wget URL browser and scp synchronize rsync -av /src server:/dest scp (batch, simplest) sftp (interactive) bbcp (parallel; large sizes) reduce size tar -cvf (create / zip) tar -xvf (extract/unzip) zip unzip software FileZilla(desktop to cluster) Globus (between clusters) WinSCP (windows) FireFTP (cross-platform) Transmit, Fugu (Mac) https://docs.computecanada.ca/wiki/Globus When in doubt, don’t delete data. When in doubt, email the admins!
44. Raamana Data management plan! • Calculate size of data you’ll produce from test runs. • What do you need to “keep”, and how long? • When is data “final” and needs to be backed up? • What is scratch and deletable, and what is not? • Is the “intermediate” data easy & quick to regenerate? • If so, should you even store it? 44
45. Raamana Should I build a pipeline? 45 • Can few parts of my project be automated? Together? • Do I repeat this processing/analysis? More than twice? • Even if they are repeated in a different manner, can I capture the variations in logic? • Are there delays due to human involvement? • Is it difficult to redo this on a different dataset or by others? • Are there concerns of reproducibility in my analysis? Yes No Yes No Yes No Yes No Yes No Yes No My thesis: “Most things can be automated!” Yes 4/6?
46. Raamana Building pipelines • We usually need to stitch together a diverse array of tools (AFNI, FSL, Python, R etc) to achieve a larger goal (build a pipeline) • They are often written in different programming languages (Matlab, C++, Python, R etc) • Mostly compiled, and no APIs • To reduce your pain, you can use bash or Python to develop a pipeline. • If it’s neuroimaging-specific, check nipy also • So, learning a bit of bash/Python really helps! • be warned, bash is not super easy, but very helpful for relatively straightforward pipelines! 46 no heavy logic? Use: When in doubt, use:
47. Raamana HPC Skills • Learning Linux goes a long way. • Most HPC clusters are in Linux! • It is reliable and free. • Great to build pipelines • Understanding of scheduling • Command-line skills • batch processing is king!! • human interaction is slow! • Scripting in bash/python • to stitch together routine or repetitive tasks into a pipeline! 47
48. Raamana Thesis/papers over pipeline! • Remember, • you want to do this to solve your own problem(s), and save time now! • and tomorrow, as reuse it! • as well as others later on. • However, • no immediate reward • not in academia! • not for pure programming. 48 • If you enjoy it, • no reason not to improve the world!