- In these posts I have selected and compiled some Best Practices and Tips for Ruby users that I have learned up till now. They are the result of a year and a half of experience working on a Business Intelligence and Big Data project.
I believe the development of this functionality would have been more effective (faster and higher quality) if I had followed these tips from the beginning.
There are 30 tips. For the post not getting too long, I separated the list into two parts.
This is the first part, where I am more focused on the optimization of tests (specs) and calls to the bank (queries).
In the second part, I talk about data treatment and its consistency. Tips and tricks when dealing with import and export ( CSV ) files.
Testing (specs)
1. Use FactoryGirl
describe Car do subject(:car) { FactoryGirl.build(:car) } end
The FactoryGirl has excellent syntax and lets you make cleaner and readable tests. In addition, it optimizes the time that it takes to create the specs, has a centralized modeling code, is flexible, and enables advanced customization. That is, it is a hand in the wheel in cases of optimization of build time and more complex models.
2. Do not create Prefer build
FactoryGirl.build(:car) # [Fast] Constructs the object in memory FactoryGirl.create(:car) # [Slow] Saves the template in the database and runs all validations and callbacks (eg after_create)
It is good practice to use the same type of bench in production and test environments. This avoids surprises in production, even if the tests are passing.
One of the consequences of this, especially using ActiveRecord, is the time of the specs to be slower by having too many calls to the bench in the tests.
To reduce this impact, try to use it whenever you can.
3. Start with the exception
describe Car do subject(:car) { FactoryGirl.build(:car, color: color) } context 'when no color is given' do let(:color) { nil } it { is_expected.not_to be_valid } end end
Many times we test only the happy, error-free path. The problem is that we leave the exceptions to the end, as background, without considering all the test scenarios.
My tip is to start with the exceptions, thinking about the most unlikely ways.
4. Describe the behavior
describe Car do subject(:car) { FactoryGirl.build(:car, fuel: fuel) } describe '#drive' do subject(:drive) { car.drive_for(distance) } context 'when driving a positive distance' do let(:distance) { 100 } context 'and there is not enough fuel' do let(:fuel) { 10 } it 'drives less than the wanted distance' do drive expect(car.walked_distance).to < distance end it 'consumes all fuel' do drive expect(car.fuel).to be 0 end end end end end
To understand how a model works, just read the description of the specs. But it’s not always so. It is quite common to see tests that do not describe the exact behavior of a model.
In the above example, everything is clearly described. I know that when I ask to drive a certain route and the car does not have enough fuel, it does not travel all the way you want. It stops halfway and the fuel runs out.
5. Test the functionality, not the implementation
def drive_for(distance) while fuel > 0 || distance > 0 self.fuel.subtract(1) self.walked_distance += distance_per_liter distance -= distance_per_liter end end drive expect(car.fuel).to eq 2 # [Good] tests functionality expect(fuel).to receive(:subtract).with(1).exactly(5).times # [Bad] tests implementation
If I change the logic of the method drive_forto this one below, the top test keeps going and the bottom one fails even with the correct logic.
def drive_for(distance) needed_fuel = distance.to_f / distance_per_liter spent_fuel = [self.fuel, needed_fuel].min self.fuel.subtract(spent_fuel) self.walked_distance += spent_fuel * distance_per_liter end
It is very rare for anyone to worry about the functionality in creating the tests and / or analyzing this in the code review. Having to rewrite the tests every time you refactor or change the implementation is a waste of time.
6. rspec –profile
Top 20 slowest examples (8.79 seconds, 48.1% of total time): Lead stubed notify_lead_update #as_indexed_json #mailing events has mailing_events 1.32 seconds ./spec/models/lead_spec.rb:209 Lead stubed notify_lead_update .tags #untag_me untags the leads 0.80171 seconds ./spec/models/lead_spec.rb:545 Lead stubed notify_lead_update .tags #tag_me tags the leads 0.778 seconds ./spec/models/lead_spec.rb:526 Lead stubed notify_lead_update .tags #tag_me tags the leads 0.75545 seconds ./spec/models/lead_spec.rb:531
This option allows us to have instant feedback on spec time, making it easier to optimize the time for each test.
A good unit test should take less than 0.02 seconds. Functional ones usually take longer.
To not have to type in all the specs, you can insert the line –profilein the file .rspec, which is at the root of your project.
Bank calls (queries)
7. Use find_each, do noteach
Car.each # [Bad] Loads all elements in memory Car.find_each # [Good] Loads only the elements of that batch (1000 by default
With each, the memory usage increases along with the size of the base as it loads the entire base in one go.
Already find_each has the fixed consumption. It is influenced only by the size of the batch, which can be easily configured using the option batch_size.
In terms of implementation, nothing changes. Then use it find_eachat ease.
8. Use pluck, do notmap
Car.where(color: :black).map { |car| car.year } # [SQL] SELECT "cars".* FROM "cars" WHERE "car"."color" = "black" # [Bad] loads all attributes of cars and uses only one Car.where(color: :black).map(&:year) # [Bad] Similar to the previous example, only minor synax Car.where(color: :black).select(:year).map(&:year) # [SQL] SELECT "cars"."year" FROM "cars" WHERE "car"."color" = "black" # [Good] Only loads the cars year attribute Car.where(color: :black).pluck(:year) # [SQL] SELECT "cars"."year" FROM "cars" WHERE "car"."color" = "black" # [Good] Similar to the previous example and has smaller and clearer syntax.
9. Use select, not pluckwhen cascading
owner_ids = Car.where(color: :black).pluck(:owner_id) # [SQL] SELECT "cars"."owner_id" FROM "cars" WHERE "car"."color" = "black" # owner_ids = [1, 2, 3...] owners = Owner.where(id: owner_ids).to_a # [SQL] SELECT "owners".* FROM "owner" WHERE "owner"."id" IN [1, 2, 3...] # [Over] Execute 2 queries owner_ids = Car.where(color: :black).select(:owner_id) # owner_ids = #<ActiveRecord::Relation [...] owners = Owner.where(id: owner_ids).to_a # [SQL] SELECT "owners".* FROM "owner" WHERE "owner"."id" IN SELECT ("cars"."owner_id" FROM "cars" WHERE "car"."color" = "black") # [Good] Performs only 1 query with subselect
When you do this pluck, it executes the query and loads the entire list of results into memory. Then, another query is performed with the result obtained previously.
With this select, ActiveRecord stores only one Relation and joins the two queries into one.
This saves the memory that would be required to store the results of the first query. In addition, it eliminates overhead to establish a connection to the bank.
The databases have been evolving for a long time and if there is any optimization that it can do in this query, it will do better than ActiveRecord and Ruby.
10. Use exists?, do not any?
Car.any? # [SQL] SELECT COUNT(*) FROM "cars" # [Bad] Count on the whole table Car.exists? # [SQL] SELECT 1 AS one FROM "cars" LIMIT 1 # [Good] Count with limit 1
The runtime of any grows according to the size of the base. This is because it makes one counting the table and then compares if the result is zero. Already exists puts one limit 1at the end and always takes the same time regardless of the size of the base.
11. Load only what you will use
header = %i(id color owner_id) CSV.generate do |csv| csv << header Car.select(header).find_each do |car| csv << car.values_at(*header) end end
We often iterate over the elements and use only a few fields. This ends up wasting time and memory because we have to load all the other fields that will not be used.
In the project, for example, I reduced more than 88% of memory usage. Reviewing and putting selecting all the queries, I was able to increase the competition with the same machine and the execution time was 12 times faster.
All this optimization gain in less than 1 hour of work. If this concern is already in your head at the time of creation, the additional cost is virtually zero.
Now we will focus on treatment and data consistency, tips and tricks when dealing with import and export ( CSV ) files.
Consistency of data
12. Set the time
Car.where(created_at: 1.day.ago..Time.current) # [Bad] The result may change depending on the time it is run selected_day = 1.day.ago Car.where(created_at: selected_day.beginning_of_day..selected_day.end_of_day) # [Acceptable] The result may still change but control of this change is relatively easy selected_day = Time.parse("2018/02/01") Car.where(created_at: selected_day.beginning_of_day..selected_day.end_of_day) # [Good] The result does not change
Let’s say we have a routine that runs daily at midnight to generate a report from the previous day.
The first implementation may generate erroneous results if it is not run at that specified time.
In the second just ensure that it will run on the correct day and the result will be as expected.
In the third, it is possible to execute on any day, just inform the correct target date.
My tip for periodic routines is to leave the target date parameter with the default for the previous period. : Smirk:
13. Ensure ordering
Car.order(:created_at) # [Bad] Can return in different order if you have more than one record with the same created_at Car.order (: created_at,: id) # [Good] Even with repeated created_at the ID is unique then the returned order will always be the same Car.order(:license_plate, :id) # Not necessary because license_plate is already unique
In situations where order is important, it is always important to always use a single attribute as a tie-breaking criterion.
When we’re not careful about it, we’re usually only going to find the production error by using dummy data in the tests and rarely cover cases like this.
My tip for this item is to invest in prevention. : Innocent:
14. Beware of where by: updated_at and batches
Car.where(updated_at: 1.day.ago.Time.current).find_each(batch_size: 10) # [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') ORDER BY "cars"."id" ASC LIMIT 10 # [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') AND ("cars"."id" > 3580987) ORDER BY "cars"."id" ASC LIMIT 10 # [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') AND ("cars"."id" > 21971397) ORDER BY "cars"."id" ASC LIMIT 10 # [SQL] ... # # [Bad] Records may be missing Car.where('updated_at > ?', 1.day.ago).find # [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') # # [Less Bad] There are no batches so the records are correct, however it can consume a lot of memory ids = Car.where('updated_at > ?', 1.day.ago).pluck(:id) Car.where(id: ids).find_each(batch_size: 10) # [SQL] SELECT id FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') # [SQL] SELECT * FROM "cars" WHERE (id IN [31122, 918723, ...]) ORDER BY "cars"."id" ASC LIMIT 10 # [SQL] SELECT * FROM "cars" WHERE (id IN [31122, 918723, ...]) AND ("cars"."id" > 3580987) ORDER BY "cars"."id" ASC LIMIT 10 # [SQL] ... # # [Best] Only the IDs of the records are preloaded and all records will be processed correctly in batch, although it can consume a lot of memory if the table is GIANT
The:updated_atattribute changes very frequently and between one SQL from one batch to another it can be updated and end up processing repeated records or failing to process others.
Unfortunately, I have not found an optimal solution for this case but pre-selecting the IDs and then iterating in batches on the ids (which is fixed) ensures that all records are processed correctly only 1 time.
Trying to understand all the working and possible exceptions of these queries is very difficult then, in doubt, avoid batches with updated_at. : Sweat_smile:
Date and time
15. TimeZone
Time.parse ("2018/02/01") Time.now # [Bad] Do not consider timezone Time.zone.parse ("2017/02/01") Time.zone.now Time.current # same thing as above line # [Good] Consider timezone
You thought you were not going to have anything to do with time?
The most common mistake is not to consider timezone in operations with date and time.
Even if your system does not need to handle timezones always use timezone so you do not waste time after fixing everything : Hourglass:
16. Very careful with Query Timezones
Car.where("created_at > '2018/02/01'") # [Bad] Do not consider timezone Car.where('created_at > ?', Time.zone.parse("2018/02/01"))
When making queries with ActiveRecord, it considers the timezone correctly so no problem.
But sometimes we need to make more “manual” queries, and in those cases that timezone care is all yours.
Use queries with parameters and o Time.zone to ensure: +1:
17. DB always works with UTC
sql = <<-SQL INSERT INTO 'cars' (id, created_at, updated_at) VALUES (#{id},#{ created_at.utc },#{Sequel::CURRENT_TIMESTAMP}) SQL
The important thing here is that when you are going to generate a SQL in hand, you need to leave all dates in UTC. : Globe_with_meridians:
18. Spread the time
class CarController def create CarCreateJob.perform_async(car_params.merge(created_at: Time.current)) render :ok end end
Often we receive a call in the API and we queue a Backgroud Job to complete the action.
When it is an action that creates something, it is very important that you record the date / time that the API was called.
This ensures that the object is saved with the correct date even if it has a delay in execution because of queues or slowness on the server. : Memo:
Importation exportation
Transporting data between systems is always a challenge.
We invented APIs, services, and other ways but still fall into the good old data transport using simple tables with comma-separated (.csv) data.
So when dealing with transport via csv, remember:
19. Do not let a case interrupt the whole process.
# Ignore invalid rows in export Car.find_each do |car| row = to_csv_row(car) if valid_row?(row) csv << row else notify_me_the_error_so_i_can_fix_it(row) end end # Begin-rescue to ensure creation on import CSV.parse(file) do |row| begin Car.create(row).save! rescue e errors.add(row, e) end end # Background jobs in import treats each row separately # The code is very clean, uses little memory and performance is better CSV.parse(file) do |row| CarCreateJob.perform_async(row) end
Has it ever happened to be running that migration script and locking it in the middle because it popped an exception or something and had to process it all over again?
Avoid this by ensuring that even if one entry gives some error the rest runs smoothly.
It’s better to have some unprocessed entries than to have everything else missing.
Also put mechanisms in place so that these exceptions are communicated to you and can be solved in some way.
Do not just add to the logs, something has to alert you, be it an email, an alert on your dashboard, anything.
If it is a very specific case, it may be easier for you to simply treat it in hand but if it is something that can affect more cases, hurry to correct it! : Runner:
20. Use “tab” as a separator in CSV
CSV.generate(col_sep: TAB) do |csv| ... csv << values.map { |v| v.gsub(TAB, SPACE) } end
For one reason or another, one day we have to import or export data in CSV format.
In export, if we choose comma as separator and have some value inserted by the comma-containing user, we have to treat this so as not to break the CSV.
The most common is to surround the fields of the type open with quotation marks ( “) and if it has quotation marks in the value we have to treat it too and so on.
The code for all this is somewhat complex, increasing the chance of bugs, unforeseen cases and possibly having a degraded performance.
If we use the “tab” character as a separator the scenario changes.
Treating a “tab” inserted by the user by replacing it with “space” is practically imperceptible in most cases and we do not have to worry about any other character, the generated “CSV” is very clean and readable.
Of course there are some cases in which the user’s “tab” is important so we always have to think before choosing.
Trust me, the “tab” is your friend.
21. Treat the data
date.strftime("%Y/%m/%d") string.strip.delete("\0") tag_string.parameterize ANYTHING_BETWEEN_PLUS_AND_AT_INCLUSIVELY = /\+.*@/ email.lower.delete(" ").gsub(ANYTHING_BETWEEN_PLUS_AND_AT_INCLUSIVELY, "@") THREE_DIGITS_SEPARATOR_CAPTURING_FOLLOWING_THREE_NUMBERS = /[.,](\d{3})/ DECIMAL_SEPARATOR_CAPTURING_DECIMALS = /[.,](\d{1,2})/ number_string .gsub(THREE_DIGITS_SEPARATOR_CAPTURING_FOLLOWING_THREE_NUMBERS, '\1') .gsub(DECIMAL_SEPARATOR_CAPTURING_DECIMALS, '.\1')
Who has never suffered with “equal” texts being considered different because of uppercase and lowercase, spaces at the beginning and end, etc.?
In the United States the date is in the month, day and year format, most of the other countries it is the day, month and year. When you simply read, you can easily end up changing the day and the month.
And that user who fills out a form with ” so-and-so + 2example.com ” and 2 different users for the same person?
There are many problems caused by this kind of nonsense, so before you go out filling the CSV of export or read from an import treat them to ensure their integrity and be happy. : Sunglasses:
22. Validate the data
time > MIN_DATE && time <= Time.current ? object.present? && object.valid? ? !option.blank? && VALID_OPTIONS.include?(option) ?
Even with everything being what it should be we sometimes receive dates from before Christ or in the future.
Null or invalid objects pop up several exceptions.
This is simple so you do not have to do it. :blush:
23. Encoding is Evil
I do not think I need to convert anyone whose encoding is evil.
This will help you in this battle.
require 'charlock_holmes' contents = File.read('test.xml') detection = CharlockHolmes::EncodingDetector.detect(contents) # {:encoding => 'UTF-8', :confidence => 100, :type => :text} encoding = detection[:encoding] CharlockHolmes::Converter.convert(content, encoding, 'UTF-8')
Even so, it’s not a silver bullet, so if it’s confidence not too loud, it’s worth telling the user that the file could not be read and asked to convert to UTF-8 or some accepted format: Mag:
24. CSV with header: true
CSV.parse(file) do |row| # row => ['color', 'year', ...] next if header?(row) # row => ['black', '2017', ...] Car.new(color: row[0], year: row[1]) end # [Bad] Need to handle the first line (header) # [Bad] It will be an error if the CSV column order changes options = { headers: true, header_converters: :symbol } CSV.parse(file, options) do |row| # row => { color: 'black', year: '2017', ... } Car.new(row) OpenStruct.new(row).color # => 'black' end # [Good] CSV will already handle the header # [Good] Independent implementation of CSV column order values = CSV.parse(file, options).map.to_a Car.create(values) # Insert all at once
When reading a CSV with header, you must ignore the first line and ensure that the user sorts the columns of the file the way we want.
We know this does not happen, there will always be a CSV with the column in the wrong order.
Because of this we end up having to read the header and interpret each line accordingly.
There is a parameter in the CSV that already does this for you.
With the option headers: true can iterate over a CSV already with an object similar to a hash.
It’s free, enjoy and use : Raised_hands:
25. CSV Lint
dialect = { header: true, delimiter: TAB, skip_blanks: true, } Csvlint::Validator.new(StringIO.new(content), dialect)
The csvlint gem statically validates the CSV file.
It is so cool that it already returns all the errors indicating the reason and the error line.
Super easy and fast.
This eliminates much of the import errors. Gun
26. Show user errors
Okay, you’ve protected your system well by validating encoding, checking for file syntax errors, and bypassing rows with errors.
Your system maybe 100% ok but the user has not yet had 100% of the important data imported.
Whatever bug you detect and ignore, let your dear user know in a very friendly way and giving tips on how he can fix it.
If possible, even generate a new CSV with only the lines with errors.
Help him and earn his loyalty. : Heart:
We still need to worry about files
27. Parameterize
path = 'path/to/Filé Name%_!@#2017.tsv' extension = File.extname(path) File.basename(path, extension).parameterize 'file-name-_-2017'
Files are saved on various operating systems for users and servers.
Especially for us, accents and strange characters can cause a lot of headaches.
Fortunately parameterize is a good remedy for this and has no contraindication: Pill:
28. Avoid large names in files
Depending on the FileSystem or communication protocol, large file names can simply be cut off.
The FTP protocol, for example, does this.
I usually like to use meaningful names and sometimes end up getting big.
I already had to cut short and cut everything because I had problems with big names. : Scream:
29. Compact the CSV
CSV files tend to have many repeated characters, so compacting the size is reduced dramatically.
I saw files of 32Mb be reduced to 6Mb so it’s worth it : Package:
30. zipRuby, not RubyZip
RubyZip is 2x slower than zipRuby and allocates 700x more objects in memory
I do not think I need to say anything else.
Got any questions?
Do you have killer tips on the subject?
Leave a message in the comments!