Sunday, December 26, 2010

Tuning HBase and Mapreduce for Mass Inserts

For several months my colleagues and I have been earning our battle scars as we run progressively larger bulk inserts into HBase. I'm going to try to recall as many of the lessons learned as possible. At the time of this writing HBase 20.6 is the latest production release:

  1. Use  HBase 20.6. We had been running 20.3 and it suffered from a hideous deadlock bug, that would cause clients to simply stall and do nothing, until they were killed by the Hadoop task tracker. Even worse, because they were killed and restarted, this caused a vicious cycle of reducers stalling, restarting, stalling, etc. All the while threads and client connections were eaten up in HBase.
  2. Use huge numbers of mappers, and huge numbers of reducers so that your data sets are broken into bite sized pieces. To do this, set mapred.tasktracker.[map|reduce].tasks.maximum (the max value that will be run by a single task tracker at once) to a value correlated to the number of cores on each box (we'll cal this "C") and the number of boxes "B". Then set set mapred.[map|reduce].tasks (the total number of mappers or reducers that will get run) equal to a*C*B. "a" is then a free parameter which controls how big your bite sized batches of data will be. As a rule of thumb, if you are reading data out of HBase via TableMapper, I try to make a*C*B between 1000 and 10000, but it all depends on how much processing you ned to do on each chunk.
  3. This is an addendum to rule (2). All sorts of bad things happen when your mappers or reducers bite off too much data, and run for too long. Just a few items on the buffet of errors and warnings that you'll chase if your mappers or reducers try to handle too much data:
    • HBase clients leases expiring
    • Chance of OoME
    • Failure to report status for more than 600 seconds causes task to get shot in the head
    • Hbase client slows down on writes(annecdotal. I can't prove this, but I know that if you want to avoid ptential slowdowns, then following rule 2, in my experience, helps a lot)
  4. Use fine grained status. Make calls to progress(), setStatus(msg), and use counters to track what your job is doing. Nothing is more annoying that not being able to tell what the hell your job is doing when it has been running for 5 hours. Rule 2 helps here too. The more mappers and reducers you have, the more continuous and fine grained will be your status reporting as each task completes with its data load into HBase.
  5. Don't screw around too much with the default settings. Ryan Rawson has put out a good set of defaults in this PPT. You can turn these dials ad infinitum but I can pretty much promise you that there is no magic set of settings. This can be a massive time sync. Stick with Ryan's recommendations.
  6. Watch out for problems with your hostnames. Let's say your hostname is "". If you execute "hostname" and all you get back is "s1" or "localhost" you are in for problems. Check your /etc/hosts and make sure that aliases like "s1" don't come before the fully qualified hostname.
  7. Watch out for msiconfigured NICs. Problems with vanishing packets sent me chasing my tail for weeks. This was the classic buffet of errors, with everything seemingly leading to no resolution.
  8. Disable the Write Ahead Log from your client code. Use an appropriately sized write buffer. There'sa point of diminishing returns here. Exceeding the size of the serverside buffers won't do you any good.
  9. Don't try to get too smart. Most bulk HBase data-producing M/R jobs are going to do some processing, then write the data from the reducer. Since the reducers all receive keys in the same order, this causes all the reducers to attack the same HBase region simultaneously. We had this "great idea" that if we reversed the keys that we wrote out of our mapper, then un-reversed them in the reducer, that our reducers would be randomly writing to different region servers, not hitting a single region in lock step. Now, I have some theories on why this seemingly innocuous approach repeatably destroyed our entire Hbase database. I won't wax philosophical here, but one thing is certain. Any table created via batch inserts of randomized keys got totally hosed. Scans became dirt slow and compactions ran constantly, even days after the table was created. None of these problems made a whole lot of sense, which is why it took 3-4 weeks of debugging for us to back this "key randomizing" out of our code. The hosed tables, actually had to be dropped for the problem, and ensuing chaos to totally abate.
  10. tail all your logs and scan them assiduously for errors and warnings. I use VNCserver with different desktops for tails of all the logs so that I never have to "re-setup" all my tailf commands when I log in and out of the server.
  11. Disable speculative execution. You don't want two reducers trying to insert the exact same content into the database.