10 tips for Recession-Proofing your Data Platform
The looming economic climate (recession) has forced many businesses to re-evaluate their technology spending. One area that is often overlooked is Big Data processing.
Small startups or even enterprises will naturally cut down on technological spending when a recession hits. This is clearly a risk to continuous innovation. However, there is no better time than now to build your organization’s competitive advantage by using data. However, data technologies costs could easily pile up so here’s some tricks from actual experience working with big data platforms on how to recession proof your data tech stack.
Tip #1: Save on data storage
Take advantage of cloud storage features that let you compress files or save them in a more efficient format. This can help you save space and reduce your data costs.
Big data file compression techniques are becoming increasingly popular as the size of data sets continue to grow. There are a number of different methods that can be used to compress big data files, but some of the most popular include gzip, bzip2, and lzma.
Gzip is a popular compression method that is used to compress a variety of different file types. It is a fast and effective compression method that can reduce the size of a big data file by up to 70%. Example filetypes are Parquet, Avro.
Bzip2 is another popular compression method that is often used to compress large files. It is a slower compression method but can achieve better compression ratios than gzip. Avro supports this compression type.
Lzma is a newer compression method that offers very high compression ratios. However, it is a much slower compression method than gzip or bzip2. Parquet, Avro and CSV support this compression method.
No matter which compression method you choose, you can be sure that it will help you save space on your hard drive and make it easier to transfer big data files. And remember, don’t forget to delete old files that you no longer need. This is one of the easiest ways to reduce your data usage and save money.
Tip #2: Use free tiers!
There are a ton of great big data technologies out there, and many of them offer free tiers that you can use to get started.
First, let’s take a look at the different types of free tiers that are available. There are two main types:
Free trials — Free trials usually last for a set period of time, after which you’ll need to pay for the service.
Free forever — Free forever plans, on the other hand, are just that — free forever as long as you stay within the free tier limits.
The most important thing to remember when using free tiers is to make sure you cancel any services you’re not using. That way, you won’t be inadvertently charged for something you’re not using. Now that we’ve covered the basics, let’s take a look at some of the best big data cloud technologies out there and how you can make use of their free tiers.
Amazon Web Services (AWS) offers a ton of great big data services, and many of them have free tiers. AWS offers 750 hours free service per month but it is an aggregate of all the instances you have in all regions. If you went above the 750 hours, you will just be billed for the overage. For example, you can use Amazon EMR (Elastic MapReduce) for free for up to 16 hours. Amazon EC2 Free Tier: This free tier provides you with access to a free Amazon EC2 instance. Amazon S3 Free Tier: This free tier provides you with access to a free Amazon S3 bucket.
Google Cloud also offers free tier where new customers are offered $300 worth of free credits for a period of 90-days. They offer free tiers for products such as compute engine, cloud storage and big query. When you stay within the Free Tier limits, these resources are not charged against your Free Trial credits or to your Cloud Billing account’s payment method after your trial ends.
Azure gives new customers $200 credits for 12 months for network, compute, storage and database, and integration services + AI and Machine learning services.
Tip #3: Build your own machine
Building a state-of-the-art machine for big data processing and machine learning can pay for itself in about 4 months when compared to a similar cloud-based solution. A local machine is also significantly more performant than cloud services in speed (mainly doing local data transfer).
Gaming PCs: Nowadays, you can easily build a gaming PC for $3000 with a capable Graphics Processing Unit (GPU). This compared to about $900 per month on average for a sizeable machine learning project which requires capable GPUs.
AI workstations: If you want specialized and purpose built machine for AI development, you might want to consider Exxact which provides NVDIA AI workstations.
There are downsides to this approach as most enterprises might require that AI / ML training should be done in a secure environment where hardware security would have to be in place in order to be compliant.
Tip #4: Spot instances
Most cloud providers have services where excess compute capacity is delivered as a service via what they call “spot instances.” These instances are usually cheaper and can be run on demand. These can only be used for applications which are faul-tolerant and does not require dedicated services running 24/7. Most data processing is done in batches and are usually run off-peak so spot instances are a good fit. Just take in account that most Spot VMs are also excluded from Compute Engine Service Level Agreement (SLA) and cannot be used for live migration. The resources can be re-allocated at any time and hence are finitely available.
Tip #5: Write performant code
Nothing is more exciting than a data engineers data pipelines becoming useful and consumption of output tables sky rocketting. However, this celebration might be short lived once they see the cloud bill. Professional data engineers know that the way they design data processing code or systems have an impact on cloud services consumption. Badly written code tend to be slow and are compute hogs when executed. When data engineers run their application locally and see that it runs split second they are already happy and would not optimize it further. However, the same code when run thousands of times in the cloud could rack up a substantial cloud hosting bill and hence they need to test the application carefully and always look for slow parts or executions in the code. Data engineers can check their RAM consumption as starters as that has direct impact on cloud costs. If there’s no reason to keep data in memory, then dump it.
Tip #6: Store data in low resolution
Storing data in low resolution is another way of saving money, if you do not need the raw data and aggregates are enough, then dump the raw data into low cost storage if you think this data would be needed in the future or dump it altogether is not needed. Using low resolution images/videos also help in managing cloud costs. Some applications that upload images are able to create thumbnails with lower resolution.
Tip #7: Vaccum your data
Removing unecessary data is also a good practice, always vacuum your databases or release memory when no longer needed. Some developers just like storing data in logs or databases just in case they need it. But this practice could result in a substantial cloud bill. Delete data that is no longer needed or relevant and cleaning logs are good practices.
Tip #8: Go Serverless
It’s always good to analyze the usage patterns of your application and you’ll be surprised that downtime exists even for the most highly used applications. A serverless architecture allows data engineers to only get billed when the code is running, especially for applications with intermittent jobs, like in cases of machine learning training execution. This approach is suitable for experimental phase projects and can always be redesigned if scalability is needed.
Tip #9: Use open source software
There are a number of reasons why open source software is particularly well-suited to big data processing, these are:
Hadoop: Hadoop is perhaps the most well-known big data solution. Hadoop is an open source framework that can be used to process and analyze large data sets. Hadoop is designed to scale up from a single server to a cluster of thousands of machines. Hadoop is used by many large organizations, including Facebook, Yahoo, and eBay.
Apache Spark: Spark is another open source platform that is often used for big data processing and analysis. Spark is a fast and general-purpose cluster computing system. Spark can be used for a variety of tasks, including machine learning, streaming, and SQL. It includes a number of components, including a general execution engine, SQL query engine, and machine learning library. Spark is used by organizations such as Netflix, Uber, and Airbnb
Apache Cassandra: Cassandra is a NoSQL database that is often used for storing large amounts of data. It is highly scalable and can be easily deployed across a cluster of servers. Cassandra is capable of handling multiple concurrent users across instances.
MongoDB: MongoDB is another popular NoSQL database that is often used for big data applications. It is known for its ease of use and scalability. It can store data types like integer, string, array, object, boolean, dates etc… It is flexible and can be installed and partitioned in cloud infrastructure. The main feature is the use of dynamic schemas where use is able to prepare data on the fly and quickly which has cost savings benefits.
Neo4J: Neo4j is the world’s leading open source Graph Database which is developed using Java technology. It is highly scalable and schema free (NoSQL) and allows ACID transactions. It has a flexible data model and can provide real-time, high availability data. But the main feature is that it is a graph database which allows connected semi-structured data which can be queried using a declarative language (Cypher) which does not require complex joins to retrieve connected related data.
Tip #10: Cost efficiency as a KPI
If we can make cost efficiency as a one of the success measure of a team’s performance, then we could expect less cloud bill surprises. This principle also enforces best practices and could allow your organization to flourish in a cloud based setup. Data engineers can evaluate all the data pipelines and delete tables that are not being consumed or only served a tactical purpose.
Big data should not break the bank if done properly. The most important thing is that if your big data project is clearly adding value and you are able to quantify the impact on the organization’s bottom-line then the costs should be tantamount to the results. Happy engineering!