Scale & optimize your data science investment through cloud migration

Author: Anat Fraenkel, Ryan Gross, Josh Bae, and Jack Heath | Credera

Scale & optimize your data science investment through cloud migration

The rise of data science and artificial intelligence (AI) has transformed the way organizations conduct business, drive innovation, and create unparalleled value. Research predicts that AI will generate $15 trillion in value across the global economy by 2030. This staggering potential has led organizations to invest heavily in data science, weaving it into the very fabric of their business operations.

ORGANIZATIONS STRUGGLE TO REALIZE VALUE

Initial investments in data science often focus on hiring top-notch data scientists who bring statistical rigor to the table. The team often sits within the business, implementing high-value predictive models that achieve significant bottom line results with their first few models.

As these teams grow, experts are increasingly bolstered by a growing crop of young data science graduates, as evidenced by the rapid expansion of university programs specializing in the field. However, these data science teams have often grown beyond even data engineering investments, building their own infrastructure to continue delivering value. This leads to the use of outdated on-premises or infrastructure as a service (IaaS) infrastructure and “wild-west” development practices that present a maintainability and scalability bottleneck. This means that, despite the influx of data science expertise, many organizations struggle to realize the scale in value that they expected.

TEAMS FAIL TO SCALE AND OPTIMIZE DATA SCIENCE PROCESSES

While data scientists excel in math and statistics, few possess software and systems engineering backgrounds. This disconnect has led to challenges in scaling and optimizing data science processes. Inconsistencies in coding style and strict, methodological approaches leads to inefficient development and slow production timelines. For example, migrating to an Apache Spark-based platform can significantly improve scalability, but many data science teams lack the experience or know-how to make this transition. Additionally, rigorous automated software-style unit testing, a cornerstone of modern software development, is often missing from data science workflows.

Organizations must optimize their data science investments to continue scaling

To address these challenges and continue scaling, organizations must optimize their data science investments. Through our client experience we see three elements that drive value for organizations.

1. Restructure teams

One key aspect of this optimization involves restructuring teams. Data science teams often start within individual business units, eventually centralizing the data science teams into a center of excellence and then decentralizing back into the business again during the optimization process. Striking the right balance between centralization and decentralization can help organizations maintain agility while ensuring a consistent and scalable approach to data science.

2. Unify data engineering and machine learning with data science teams

Another essential component of optimization is the unification of data engineering and machine learning development with data science teams. By fostering close collaboration between these groups, organizations can ensure that machine learning (ML) models are production-ready from the outset, reducing time to market and improving overall efficiency.

3. Migrate to a cloud-based platform

Finally, the scalability of cloud-based platforms is crucial for handling the massive data volumes needed to run transformative workloads. By migrating to a cloud-based platform, organizations can leverage the inherent scalability and flexibility of the cloud to handle increasing data volumes and workloads, driving innovation and business value. This migration is typically foundational to other organizational and process-based improvements, and this article will focus on the mindset and approach to succeed in a migration.

Migration Mindset

Migrating data science workloads to the cloud is quite distinct from application migrations. For organizations that have experience migrating data warehouse workloads, data science workloads likewise introduce additional complexity in the following four ways.

  1. Infrastructure complexity: Data science teams typically enjoy more freedom and access to specialized hardware compared to application teams. This enables them to explore, experiment, and iterate on their models and algorithms more rapidly. Despite these differences, data science environments are generally more homogeneous than enterprise application portfolios. There are usually only a few profiles of development environments used by data science teams, making it easier to standardize and streamline the migration process.

  2. Diverse technology stacks: Data science workloads often involve a wide range of tools and frameworks, such as Jupyter Notebooks, TensorFlow, PyTorch, and scikit-learn. The complexities of DataFrame libraries like pandas or R make code translation more complex than SQL. These diverse technology stacks increase the complexity of migration, as each component may require different configurations or adjustments to work seamlessly on the cloud platform.

  3. Code optimization: Data science code may not be optimized for performance or scalability due to the focus on experimentation and rapid iteration during model development. Migrating to a cloud platform may necessitate re-architecting or refactoring parts of the code to ensure efficient resource utilization and complete use of the cloud’s capabilities.

  4. Reproducibility: Data science workloads often involve complex machine learning models and algorithms, making it crucial to maintain reproducibility during migration. This means that validation of the migration can be challenging, as even slight changes in data, code, or environment can lead to vastly different inference, making it difficult to validate the performance of migrated workloads.

Despite these differences, there are some core approaches developed for cloud application and warehouse migrations that can be adapted and applied to data science migrations for a smoother and more efficient transition. For example, teams can adapt the 6Rs migration framework to suit the specific needs of data science products. This framework provides a structured way to assess and prioritize workloads for migration, ensuring a systematic and efficient transition to the cloud.

Additionally, leveraging automated data migration and automated code transformation tools can help data science teams migrate their workloads more quickly and seamlessly. By automating the transformation of code to cloud-native frameworks like Apache Spark, teams can significantly reduce manual effort and ensure optimal performance on the cloud platform.

There are six ways to migrate a data product

The 6Rs approach, popularized by AWS, is a framework that helps organizations plan and execute migrations of applications or data products to the cloud or modernized platforms. The 6Rs stand for Rehost, Replatform, Refactor, Rebuild/Repurchase, Retire, and Retain. Here’s an explanation of each approach in the context of a data science team:

1. Rehost: Also known as “lift-and-shift,” this approach involves moving your ML models, notebooks, and data pipelines as-is from the current environment to the new platform. Minimal changes are made to the codebase, which might involve adjusting configurations or dependencies to ensure compatibility. This is the quickest way to migrate but may not take full advantage of the features or optimizations available in the new environment.

2. Replatform: This is a “lift-tinker-and-shift” approach, and involves making slight modifications to the existing ML models or data pipelines to optimize them for the new platform. Examples include adjusting the code to use a different database, changing the data storage format, or modifying the model training process to leverage the new platform’s capabilities. Replatforming typically results in improved performance, scalability, and maintainability without a complete overhaul of the codebase.

3. Refactor: In this approach, the data science team re-architects or rewrites the ML models, notebooks, and data pipelines to take full advantage of the new platform’s features and capabilities. This might involve adopting new programming paradigms, utilizing different ML frameworks, or redesigning data pipelines to leverage cloud-native services. Refactoring requires significant effort but can result in substantial improvements in performance, maintainability, and scalability.

4. Rebuild / Repurchase: This approach involves starting over with the same goal but employing an entirely new approach. Rebuilding means leveraging the full feature set of the cloud platform to build an ML model that significantly outperforms the old model. A special case of a rebuild is a “repurchase”, which involves replacing the current ML models or data pipelines with commercially available off-the-shelf solutions on the new platform.

This approach might be suitable for a data science team if their existing models or pipelines have become too complex or difficult to maintain, or if better alternatives exist on the market. Repurchasing requires careful evaluation of the available options and potential trade-offs in functionality and customization.

5. Retire: The retire approach involves identifying and decommissioning ML models, notebooks, or data pipelines that are no longer needed, have been replaced, or provide little value. Retiring these assets can help simplify the migration process, reduce maintenance overhead, and focus resources on higher-impact initiatives.

6. Retain: Retaining involves keeping the existing ML models, notebooks, or data pipelines in their current environment, either because they are still functional or due to other constraints such as regulatory requirements or lack of resources. In this case, the data science team may choose to revisit the migration decision later or maintain a hybrid approach, where some assets are migrated and others are retained in the current environment.

By applying the 6Rs framework to their migration strategy, data science teams can determine the most suitable approach for each ML model or data pipeline, ensuring a smooth transition to the new platform with minimal disruption.

Our recommended approach for data science migrations

We recommend a three-step roadmap for accelerating data science migration: Assess and Plan, Mobilize, and Migrate.

PHASE 1 ASSESS AND PLAN: CATALOG AND TRIAGE ML MODELS AND OTHER DATA PRODUCTS PIPELINES

The planning phase is crucial to the success of any data science migration. A successful migration begins by cataloging and triaging all data products pipelines, such as those dedicated to ML models, dashboards, tables, and A/B test support. Using guidelines from AWS, teams define the six Rs approaches and create a decision tree to help triage data products into a migration approach. They then develop an initial playbook for migration execution, including accelerators such as automated code translators and data migration tools, and create a regression testing and validation plan for migrations, incorporating additional automation where necessary.

PHASE 2 MOBILIZE: EXECUTE PILOT MIGRATIONS AND TRAIN DATA SCIENTISTS

During the mobilization phase, the migration team selects a sample of migrations using a framework that maximizes coverage across the 6Rs migration categories, model/pipeline complexity, and business stakeholders and teams. The migration team then:

  • Executes pilot migrations and update the playbook details.

  • Engages expertise from their organization or external partners and leverage new platform features to build reference implementations.

  • Implements a validation plan to ensure the quality of data and models is maintained and define the future state code repository structure.

  • Documents lessons learned in their migration playbook as best practices to always be followed.

  • Implements new features on the future state platform, such as experiment tracking, model registry, continuous integration and continuous development (CI/CD) deployment, and end-to-end data pipeline observability.

Training data scientists on the new approach is essential for successful migration. Pair them with experienced individuals and supplement with formal training on different ways of working and utilizing the features of the new platform.

PHASE 3 MIGRATE: DISTRIBUTE WORK TO EXECUTE MIGRATIONS AND RUN MIGRATED MODELS IN PARALLEL

In the final migration phase, the migration team often expands, using the information learned during pilot migrations to update the playbook and six R triage. The expanded team leverages the playbook developed during the mobilize phase as a guide for best practices. By this time, the patterns should be well known, so the migration program can distribute work across the platform, data engineering, and data science teams. Each migration should leverage frequent, small commits to break down refactoring into manageable chunks, and use the pull request review process to ensure that proper automated testing is in place to validate results.

Finally, the validation phase involves running migrated models in parallel with the original for at least two inference cycles (e.g., if the model runs monthly, run in parallel for two months) to ensure seamless transition and maintain quality.

Scaling your investment

By following this three-step roadmap, businesses can accelerate their data science migration, reducing the time and complexity involved in the process. At Credera, we’ve seen this approach facilitate major value for organizations including a major energy provider. Through a collaborative partnership, Credera successfully delivered a future state operating model that enables unified ways of working and consolidated technology platforms across a team of more than 30 data scientists, supporting over 60 data products that power the company’s core processes. We’d love to start a conversation about what is next for your organization’s data investment. Reach out to us at [email protected].

Case Study: A Large Retail Energy Provider Scales & Accelerates Data Science

In a rapidly evolving energy landscape, a major energy provider embarked on a transformative journey to scale and accelerate their data science capabilities and move data teams across the enterprise into a hub and spoke model. Through a collaborative partnership, Credera successfully delivered a future state operating model that enables unified ways of working and consolidated technology platforms across a team of more than 30 data scientists, supporting over 60 data products that power the company’s core processes.

CHALLENGE

The client faced several significant challenges in their data-driven transformation. Their data teams worked on vastly different platforms in silos, hindering effective collaboration and knowledge sharing. While the client had established an enterprise data and analytics platform and team, they had yet to consolidate data platforms across business units. Their largest data science team had built up their own data platform based on Kubernetes, with each model running in its own Docker container and connecting to an on-premises data warehouse for data access and storage. The data science team relied on tribal knowledge silos for how each model operated. In addition, intermediate data files were frequently stored as Python pickle objects on a shared file system, with others stored in various formats in an S3 bucket, adding further complexity.

SOLUTION

Our solution began by defining new ways of working across hub-and-spoke teams and building a compelling business case for migration. To address the gaps in data ingestion, observability, and machine learning operations (MLOps) in the core data platform, we recommended and implemented a comprehensive approach that streamlined the processes. Data ingestion leveraged Qlik Replicate to pull raw data from Oracle databases into an S3 data lake. The data transformation layer was based on Databricks on AWS, with data observability provided through Monte Carlo, experiment tracking implemented with MLFlow, and a model monitoring reference implementation using AWS SageMaker. These new features provide end-to-end visibility into the ML model training and inference pipelines to proactively detect any issues that have occurred.

Utilizing the ‘six Rs’ approach, we triaged data products and created a playbook for each type of migration, ensuring smooth transitions. To help provide real-world insight for the overall migration plan, we transitioned 11 workloads from the legacy Kubernetes environment as reference implementations, paving the way for the full migration. These pilots showed how to leverage a medallion architecture with structured data storage in Delta Lake on S3, a model and experiment registry to track ML training, instrument pipelines with data observability, and Sagemaker Model Monitor to test and track ML results.

RESULTS

This collaborative effort resulted in several key achievements. First, it gained buy-in for an expedited migration from a traditionally siloed (and comfortably so) business unit, demonstrating the value of the new approach. Furthermore, the new data science platform offers much-improved reliability for the end-to-end data products that support the business.

“The Credera team’s technical expertise, deep understanding of our business, well-thought-out methodology, and focus on change management were all critical to this project’s success. We ultimately achieved amazing breakthroughs in our technical infrastructure and business outcomes thanks to the team’s tremendous accomplishments. They were just what we needed to overcome the considerable hurdles that we faced.” – Senior Director Data Governance and Engineering at NRG

The client’s journey highlights the power of collaboration and a well-designed roadmap in driving data science transformation. By addressing the challenges head-on and crafting tailored solutions, we enabled the client to scale and accelerate their data science capabilities, laying the foundation for a data-driven future. Our approach offers a compelling case for other businesses looking to unlock the full potential of their data and stay competitive in a rapidly evolving industry.