Konferencja BigData Tech Warsaw 2017

Międzynarodowa edycja znanej w Polsce niezależnej konferencji o technologiach bigdata. Wydarzenie poświęcone jest technologiom bigdata w Polsce, podczas którego nie ma marketingowych i sprzedażowych prezentacji, a program tworzony jest pod potrzeby uczestników – tych, którzy odpowiadają za projekty i wdrażanie technologii bigdata.

  • Planowana liczba uczestników: 300 – 400 osób
  • Konferencja zgromadzi administratorów, architektów i programistów baz danych, inżynierów i analityków danych, właścicieli produktów i menedżerów IT – z firm, które wykorzystują i tworzą rozwiązania intensywnie korzystające z danych

THE ONLY CONFERENCE IN POLAND FOR PROFESSIONALS WORKING WITH BIG DATA, DEVOTED TO TECHNICAL ASPECTS OF IMPLEMENTING, EXPANDING AND USING BIG DATA SOLUTIONS.

Big Data Tech Warsaw is an exciting one-day conference with purely technical talks in the field of Big Data analysis, scalability, storage and search. In our agenda you will mostly find practical and technical talks given by practitioners working at top data-driven companies who share their tools, models, successes and failures.

We don’t accept marketing and sales presentations and our agenda is not influenced by any single large vendor. The audience – packed full of technical folks! – is our main focus!

The conference has become a source of knowledge about Big Data solutions and their practical use. We gather very well selected group of speakers consisting of practitioners – including real pioneers of Big Data projects.

The interest in the conference for two previous editions has exceeded all expectations. That is why we continue to follow that direction and make Big Data Technology Warsaw Summit purely European event. Progress in this field is immense, and our conference is mainly aimed at presenting the most current and vital aspects of Big Data technologies.

It will be a real meeting of professionals. We look forward to seeing you there!

Why to attend?

  • Many outstanding speakers who work with Big Data in top data-driven companies  like Uber, Alibaba, Spotify and Zalando.
  • Four technical tracks that cover the most important and up-to-date aspects of Big Data, including deep learning, real-time stream processing and cloud.
  • Purely technical and independent content because there is no big vendor being the main sponsor/organizer of the conference who selects or curates the talks
  • 400 participants from a number of European countries and companies that use Big Data in production use-cases.
  • Three technical and practical workshops a day before the conference
  • Two previous editions of the conference in 2015 and 2016 were great success.

Content

There will be plenary session with the best keynotes and 4 parallel tracks:

  • Operations & Deployment – dedicated to system administrators and people with DevOps skills who are interested in technologies and best practices for planning, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.
  • Hadoop Application Development – for developers to learn about tools, techniques and innovative solutions to collect and process large volumes of data. It covers topics like data ingestion, ETL, process scheduling, metadata and schema management, distributed datastores and more.
  • Analytics & Data Science – real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.
  • Real-Time Processing – technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.

Networking opportunities

  • Big Data solutions developers and architects
  • Hadoop infrastructure and implementation experts
  • Data scientists and Big Data analysts

The previous two editions of the conference was attended by over 500 persons.

They included representatives of various industries, particularly IT, banking, finance, telecommunications, energy and media.

This year’s edition we will have roundtables sessions – which will address the participants’ needs and interests. There will also be the evening meeting for our paricipants.

Konferencja BigData Technology Warsaw Summit 2017 agenda

Conference opening Przemysław Gamdzyk CEO & Meeting Designer, Evention, Adam Kawa Data Engineer and Founder, GetInData

Machine Learning the Product Boxun Zhang Data Scientist, Spotify

  1. A/B testing is a popular method for learning your product. However, with traditional A/B testing techniques, we can only learn from A/B test in a rather superficial way – we can measure the size of an effect but often don’t know the cause of that effect. In this presentation, I will introduce a different, machine learning approach used at Spotify for analyzing A/B test, aiming to reveal the cause of effect and maximize learning.

Simultaneous sessions

Operations & Deployment

This track is dedicated to system administrators and people with DevOps skills who are interested in technologies and best practices for planning, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Creating Redundancy for Big Hadoop Clusters is Hard Stuart Pook Senior DevOps Engineer, Criteo

  • Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.

Spotify’s Event Delivery Nelson Arapé Backend Developer, Spotify

  • Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly active users. We have over the last few years have a phenomenal growth that now has pushed our backend infrastructure out from our data centers and into the cloud. Earlier this year we announced that we are transitioning all of our backend into Google Cloud Platform, GCP.
  • Our event delivery system is a key component in our data infrastructure, that delivers billions of events per day with predictable latency and well defined interface for our developers. This data is used to produce Discover Weekly, Spotify Party, Year in music and many more Spotify features. In this talk will be about focus on the evolution of the event delivery service, the lessons learned and present the design of our new system based on Google Cloud Platform technologies.

Key challenges in building large distributed full-text search systems based on Apache Solr and Elasticsearch Tomasz Sobczak Senior Consultant, Findwise

  • There are large distributed search platforms based on the most popular two search engines: Apache Solr and Elasticsearch. For a long time these two technologies can do much more than full-text search. They are scalable and highly productive noSQL (document-oriented) databases, which are able to store massive data and serve vast number of requests. This is why we can discuss Solr and Elasticsearch in terms of big data projects. Let discuss challenges connected with data indexing and searching, configuring clusters, scaling and distributing them between data centers. During the presentation there will be an overview of available features and issues, but it won’t be another comparison of Solr and Elasticsearch. Both technologies are well-proven software and instead of favoring one of them I would like to present all their possibilities.

ING CoreIntel - collect and process network logs across data centers in near realtime Krzysztof Adamski Solutions Architect (Big Data), ING Services Poland, Krzysztof Żmij Expert IT / Hadoop, ING Services Poland

Data Application Development

This track is the place for developers to learn about tools, techniques and innovative solutions to collect and process large volumes of data. It covers topics like data ingestion, ETL, process scheduling, metadata and schema management, distributed datastores and more.

DataOps or how I learned to love production Michael Hausenblas Developer Advocate, Mesosphere

  • A plethora of data processing tools, most of them open source, is available to us. But who actually runs data pipelines? What about dynamically allocating resources to data pipeline components? In this talk we will discuss options to operate elastic data pipelines with modern, cloud native platforms such as DC/OS with Apache Mesos, Kubernetes and Docker Swarm. We will review good practices, from containerizing workloads to making things resilient and show elastic data pipelines in action.

Bryan Dove SVP Engineering, Skyscanner

One Jupyter to rule them all Mariusz Strzelecki Senior Data Engineer, Allegro Group

  • If you tell your colleagues you develop Hadoop applications, they probably find you a geek that knows Java,
  • MapReduce, Scala and a lot of APIs for submitting, scheduling and monitoring jobs. And of course is a Kerberos expert. Actually, it might be quite real a few years ago, but nowadays Big Data ecosystem contains many tools that enable Big Data for everyone, including non-technical guys. In Allegro we simplified the way of creating applications that gain value from datasets. Look how we maintain full development process from the very first line of code to production deployment, in particular: * develop and maintain code inside Jupyter using pySpark as Big Data framework, * store codebase in git repositories and perform code-review process, * create and maintain unit tests and integration tests for pySpark applications, * schedule and monitor these processes on Hadoop cluster. Why using CLI for Big Data is pretty obsolete.

Orchestrating Big Data pipelines @ Fandom Krystian Mistrzak Data Engineer, Fandom Powered by WIkia, Thejas Murthy Data Engineer, Fandom Powered by WIkia

  • Fandom is the largest entertainment fan site in the world. With more than 360,000 fan communities and a global
  • audience of over 190 million monthly uniques, we are the fan’s voice in entertainment. Being the largest entertainment site, wikia generates massive volumes of data, which varies from clickstream, user activities, api requests, ad delivery, A/B testing and much more. The big challenge is not just the volume but the orchestration involved in combining various sources of data with various periodicity, volumes. And Making sure the processed data is available for the consumers within the expected time. Thus helping gain the right insights well within the right time. A conscious decision was made to choose the right open source tool to solve the problem of orchestration, after evaluating various tools we decided to use Apache airflow. This presentation will give an overview of comparisons of existing tools and emphasize on why we choose airflow. And how Airflow is being used to create a stable reliable orchestration platform to enable non data engineers to seamlessly access data by democratizing data. We will focus on some tricks and best practises of developing workflows with Airflow and show how we are using some of the features of airflow.

Mark Pybus Head of Data Engineering, Sky Betting & Gaming

  • Sky Bet is one of the largest UK online bookmakers and introduced a Hadoop platform 4 years ago. This session
  • explains how the platform addresses 2 common problems in the gambling industry – knowing your current liability position and helping potential irresponsible gamblers before they identify themselves. These use cases are linked by a common need for data from the same source systems and highlight the different uses of the data that can co-exist on a shared Hadoop Cluster The journey of replacing a traditional data warehouse with the promised land of Hadoop will be explained. It won’t forget the mis-turns and slips made along the way – this is no Proof-of-Concept idealistic talk, real world implementations are difficult. The journey will start with the first use case, meeting the needs of sportsbook traders to be able to manage liabilities in a competitive and high frequency environment and how that led, years later, to completely decommissioning the legacy data warehouse. The platform has evolved into supporting a Data Science team and the ability to create predictive models that warn of potential irresponsible gamblers. This more recent use case illustrates a completely different way of using the same data and how the engineering approach accommodates it. There’s no code in the talk, the aim is to explaining how a real world system delivered real world use cases and the teams that need to deliver them.

Analytics & Data Science

This track includes real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.

Meta-Experimentation at Etsy Emily Sommer Software Engineer, Etsy

  • Experimentation abounds, but how do we test our tests? I’ll share some ways we at Etsy proved our experimentation
  • methods broken, and the approach we took to fixing them. I’ll discuss multiple ways of running A/A tests (as opposed to A/B tests), and a statistical method called bootstrapping, which we used to remedy our experiment analysis.

H2O Deep Water - Making Deep Learning Accessible to Everyone Jo-fai Chow Data Scientist, H2O.ai

  1. Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water.  After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O’s R/Python/Flow (Web) interfaces.

Large scale data analytics and Machine learning at Google Jarosław Kuśmierek Engineering Manager: Cloud Platform & Warsaw Eng Site Lead, Google

  1. Google has been tackling large-scale big data problems for more than 15 years. Experiences with the systems we built to do so has led us to develop a new set of tools for large-scale analysis and queries, including streaming. The more recent research also allowed us to build number of Machine Learning tools, helping us find the patterns and answers to questions we couldn’t otherwise. Over last few years we have focused on exposing those tools to external users – to address growing needs in the industry. During the presentation there will be an overview of some of the systems we’ve built and examples of how those can be applied to solve big problems.

Big data in genomics Marek Wiewiórka Solution Architect, GetInData

  • Genomic population studies incorporates storing, analyzing and interpretation of various kinds of
  • genomic variants as its central issue. When thousands of patients sequenced exomes and genomes are being sequenced, there is a growing need for efficient database storage systems, querying engines and powerful tools for statistical analyses. Scalable big data solutions such as Apache Impala, Apache Kudu, Apache Phoenix or Apache Kylin can address many of the challenges in large scale genomic analyses. The presentation will cover some of the lessons-learned from the project aiming at creating a data warehousing solution for storing and analyzing genomic variants information at Department of Medical Genetics Warsaw Medical University. Overview of the existing big data projects for analyzing data from next generation sequencing will be given as well. Presentation will conclude with a brief summary and future directions discussion

Real-Time Processing

This track covers technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems. Use-cases where Flink is better than technologies like Hive, Spark, Spark Streaming and why Adam Kawa Data Engineer and Founder, GetInData While there are many popular open-source technologies for processing large datasets, Apache Flink is one that excites me the most. Not because it provides sub-second latency at scale, exactly-once semantics or a single solution for batch and stream processing. But because … it lets you accurately process your data with little effort – something that is hard or usually ignored with Spark, Storm, Hive or Scalding. In this talk I will explain unique capabilities, ideas and design patterns of Flink & Kafka for accurate and simplified stream processing in batch and real-time.

Real-Time Data Processing at RTB House – Architecture & Lessons Learned Bartosz Łoś Software Developer, RTB House

  • Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid
  • requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds.
  • We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution.
  • We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.

Use-cases where Flink is better than technologies like Hive, Spark, Spark Streaming and why Adam Kawa Data Engineer and Founder, GetInData

  • While there are many popular open-source technologies for processing large datasets, Apache Flink
  • is one that excites me the most. Not because it provides sub-second latency at scale, exactly-once semantics or a single solution for batch and stream processing. But because … it lets you accurately process your data with little effort – something that is hard or usually ignored with Spark, Storm, Hive or Scalding. In this talk I will explain unique capabilities, ideas and design patterns of Flink & Kafka for accurate and simplified stream processing in batch and real-time.

Stream Analytics with SQL on Apache Flink Fabian Hueske Software Engineer, data Artisans

  • SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative, many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools.
  • Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite.
  • In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.

Blink - Alibaba's Improvements to Flink Xiaowei Jiang Director of Search Division, Alibaba

  • A large portion of transactions on Alibaba’s e-commerce Taobao platform is initiated through its Alibaba Search engine. Real time data streaming processing is one of the cornerstones in Alibaba’s search infrastructure. Among all the streaming solutions, Flink is the closest to meet our requirements. In this talk, we present the design and implementation of Blink, an improved runtime engine for Flink, better integrated with Yarn. It also addresses various scale and reliability issues we encountered in our production. Since the changes are at the runtime layer, Blink is fully compatible with the Flink API and its machine learning libraries. We will also share the experience in our production use of Blink in a Hadoop cluster of more than one thousand servers in Alibaba Search. We are actively working with the community to contribute the changes back to Apache Flink.

RealTime AdTech reporting & targeting with Apache Apex Ashish Tadose Senior Data Architect, PubMatic

  • AdTech companies need to address data increase at breakneck speed along with customer demands of insights &
  • analytical reports. At PubMatic we receive billions of events and several TBs of data per day from various geographic regions. This high volume data needs to be processed in realtime to derive actionable insights such as campaign decisions, audience targeting and also provide feedback loop to AdServer for making efficient ad serving decisions. In this talk we will share how we designed and implemented these scalable low latency realtime data processing solutions for our use cases using Apache Apex.

Hopsworks: Secure Streaming-as-a-Service with Kafka/Flink/Spark Jim Dowling Associate Prof, KTH Royal Institute of Technology

  • Since June 2016, Kafak, Spark and Flink-as-a-service have been available to researchers and companies in Sweden from the Swedish ICT SICS Data Center at www.hops.site using the HopsWorks platform (www.hops.io). Flink and Spark applications are run within a project on a YARN cluster with the novel property that applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running streaming applications, how we use Graphana and Graphite for monitoring streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Oct 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. Hopsworks is entirely UI-driven with an Apache v2 open source license.

Konferencja BigData Technology Warsaw Summit 2017 speakers

  • Nelson Arapé Backend Developer, Spotify
  • Jo-fai Chow Data Scientist, H2O.ai
  • Bryan Dove SVP Engineering, Skyscanner
  • Xiaowei Jiang Director of Search Division, Alibaba
  • Emily Sommer Software Engineer, Etsy
  • Boxun Zhang Data Scientist, Spotify

Sheraton Warsaw Hotel, B. Prusa 2, Warszawa

Informacje dot. plików cookies

ComminT Sp. z o.o. stosuje pliki cookies, które są niezbędne w celu odpowiedniego funkcjonowania ich stron internetowych. Nasza strona internetowa używa plików „Cookies” w celach statystycznych, reklamowych oraz funkcjonalnych. Dzięki nim możemy indywidualnie dostosować stronę do Twoich potrzeb. Masz możliwość zmiany ustawień dotyczących plików „Cookies” w przeglądarce, dzięki czemu nie będą zbierane żadne informacje. Dalsze informacje znajdują się w zakładce Polityka prywatności.