Facebook C++ Conference

I managed to snatch tickets for the Facebook C++ Conference. I will be live blogging the event all day. Hope you enjoy!

20.49

Summary

Mainstream hardware is becoming permanently parallel, heterogeneous and distributed.
– Moore’s Mine is ending, but we know the answer: Open a new mine.
– As usual, the end of one wave overlaps the beginning of the next.
The long-term free lunch app requires lots of juicy latent parallelism expressed in a form that can be spread across a machine with a variable number of cores of different kinds – big/small/specialized cores, and local and distributed cores.
– The filet mignon of throughput gains is still on the menu, but now it costs extra effort – development, complexity, and testing.

20.12

Hardware is in motion! This talk is tougher to describe, as the presentation relies on many graphics.

20.00

We are observing the end of Moore’s Law. We can no longer rely on single-threaded performance increases, we need to go beyond multi-core to cloud-core and heterogeneous cores.

19.52

Trifecta
– Multicore (2005-) e.g. iPad 2, Kindle Fire, etc..
— Initial transition now complete in all mainstream form factors
– Heterogeneous cores (2009-) e.g. compute-capable GPUs.
— 100x and 1000x parallelism already available on home machines for apps that can harness the GPU.
– Elastic compute cloud cores (HaaS, 2010-) e.g. AWS, Azure
— Summer 2011

19.50

We are entering a Jungle! 2011-201x Put a heterogeneous supercomputer cluster on every desk, in every home and in every pocket.

19.47

Welcome to the Jungle by Herb Sutter

19.35

Short break before the next talk by Herb Sutter.

19.30

Summary
Performance and Efficiency

Critical for Technological Advancement
enables Fundamentally New Products and Experiences
Incremental Steps Add Up to Game-Changing Optimizations

19.28

Summary
AtomicHashMap

Atomic Operations are Slow
– Remove Ordering Constraints if Possible
Unique Set of Trade Offs
– Not Right For All Applications

19.18

Reference Implementation
Micro-Sharding

– Fragment Data to Reduce Contention
– Use Spin Locks Since Contention is Low

19.13

folly::AtomicHashMap
Benchmarking

– Simple
– Isolate Important Components

19.10

Projecting Hash Values

Hybrid Solution:
– Make Array Optimal Size
– Mask by Next Power of 2
– If Outside the Array, Use Modulus

19.09

Projecting Hash Values

Modulus is Slow
– Integer Division
Masking With Power of 2 Is Wasteful
– nextPowTwo(1000 Entries/ 80% load factor = 1250)

19.06

Approach

Increments Cached Per Thread
– Periodically Flushed Atomically
– Traverse All Threads for Exact Value

New ThreadLocal Class
– Fast
– Provides Iterator Across All Threads

19.01

folly::ThreadLocal
Implementation

Store Arrays of Pointers with __thread
Unique Index Per ThreadLocal Instance
User Mutex on ctor/dtor/iterate
– One per “Tag”
About 4x Faster than boost::thread_specific_ptr

18.58

Tracking Size

Obvious Solution : Atomic Increment
– But wait! 50M inc/sec < 100 insert/sec – Atomic Increment is a Bottleneck Try again: ThreadCachedInt Sharded Atomic Increment – Requires a Lot of Shards Thread Local Storage – Better with New ThreadLocal Class

18.56

Remaining Issues

How to Track Size?
Modulus is Slow

18.54

Erase

1. Do a Find
2. Miss -> Failed Erase
3. Found ->compxchg to Empty Sentinel Key

Empty -> Locked -> Valid -> Erased

18.50

Growth

1. Fill to Max Load Factor
– Can’t Rehash

2. Create Another SubMap

3. Re-Project and Continue Probing

18.43

Atomic Probing Algorithm

Always in a Consistent State
– find() has No Memory Barriers

insert() Uses Keys for Locking
– Very Low Contention
– No Additional Storage

18.41

Probing Algorithm
– Project Hash Onto Array
– Advance Until Equal or Empty
– Wrap Around

18.39

New Approach: AtomicHashMap

folly:AtomicHashMap
Result Preview
– Blistering Speed (1.3x-5x)
– Memory Efficient + Compact References
– std::unordered_map Interface

Except…
– Only Integer Keys
– Sensitive to Initialization Size
– Erased Entities Never Reclaimed

18.36

Graph Storage and Access
Possible Approaches

Micro-Sharding
Layered Maps
Intel tbb:concurrent_hash_map
Buil Something New

18.36

Graph Storage and Access
Considerations

– CPU Bottleneck
– Memory Sensitivity
– Many Cores Accessing the Same Data
– Updated in Real-Time

18.35

What’s it look like?

Huge trove of data, many users have 1M 2nd degree connections
Data and Access Highly Dynamic, live updates, new products
Bottom line: we need performance

18.33

Facebook: The Graph?

What’s it Good For?
News feed, what’s going on in the action graph.
Search, find people and things
Ads, Groups, etc.

18.32

What should we do about performance? Challenge assumptions, hardware tradeoffs are changing, talk and share pre.

18.30

At facebook: incremental changes add up, speed is critical for virality, help the environment -> performance is very important.

18.29

Massively Parallel Hashmaps: Spencer Ahrens

Why do we care about hashmap performance?

18.19

Short break between talks, will be back!

18.15

folly: https://github.com/facebook/folly

18.14

Folly components:

1. Containers
2. Parsing and formatting
3. Multithreading helpers
4. General utilities
5. More to come

18.13

fbstring will be open-sourced!

folly: Facebook’s Open-Source Llibrary

18.13

Andrei emphasizes that you should learn math, cs people really need to know math.

18.06

fbstring speedup – 2.53 times faster with ILP and loop unrolling

18.02

modulo subtraction is injective

17.55

ILP speedup to atoul gives a 193% speedup over atoul.

17.53

associative means parallelizable

17.53

Minimizing data dependencies

“5235” -> 5235! string to integral (ids are numbers stored as strings)

string atoui -> loops each char and converts to the appropriate digit… see atoui implementations

fbstring actually takes advantage of loop unrolling when converting strings to ints.

17.45

Better instruction level parallelism = fewer data dependencies

17.45

Instruction Level Parallelism

1. Pipelining
2. Superscalar execuion
3. OoO execution
4. Register renaming
5. Speculative execution

17.43

The Forgotten Parallelism

People forget about CPU parallelism which could be taken advantage of. Each core in facebook server, nehalem cpu, has 3 ALUs which are forgotten. You can do three simultaneous arithmetic calculations in a single thread! (SandyBridge has more than 3 ALUs). How do we use them?

17.39

Some Benchmarks

push_back speedup vs std::string is huge compared to the standard implementation. 400% speedup for 1 push_back call, 500% for 2 push back calls, less than 200% for 256k push_back calls. Plateau is 1.4 times faster than standard implementation.

17.34

The French Connection: jemalloc

Jason Evans wrote a great allocator for FreeBSD and it’s used everywhere at facebook.

jemalloc
1. Highly concurrent
2. Cache-friendly
3. Fast

Extended allocator interface
1. Allocate in allocator-sized slabs
2. In-place expansion where possible

fbstring solves fragmentation by knowing what jemalloc wants, the string will use what jemallocs wants to allocate its internal representation.

jemalloc allows reallocation in place – e.g. a large string built in a loop will be improved and occasional quadratic algorithms become linear! woot!

17.27

23 >> 15 (23 is the max chars in a fbstring, 15 is what other libraries use for string)

23, 15 is not a big deal… actually it is because at facebook they have many protocols that store ints as strings, many ints are smalls but not all! User ids at facebook are randomly generated, 64 bits. Almost all these ids don’t fit in 15 chars… thus will spill into allocated memory but with fbstring none will spill and thus be highly optimized.

17.24

fbstring has control. fbstring lets them take full control over where data goes, optimizes memory allocation. The performance impact is huge.

Andrei is making fun of Python/Java programmers for not knowing what happens in memory.

17.22

fbstring data layout

Small
hello, world! slash0 | 9 (9 is the number of bytes unused, allows them to append quickly and the last byte is used as a null terminator when there are no more bytes unused)

Medium/Large
char* data | size_t length | size_t capacity | FF
(if FF is all ones then we will consider the string in this format otherwise in the small format.

Large
atomic refCount | hello slash0

17.17

fbstring’s 3-tiered layout

1. In-place store for 0..23 chars
2. Eagerly copied values 24..255 chars
3. Copy-on-write 256 chars and above

17.16

We are live again! Andrei is speaking, the title of the talk is Sheer Folly.

C++, why do we care? Why do we still care about system languages.
1. Data Layout Control – How data sits in memory? C++ allows you to literally tell where ever byte goes. Example: fbstring. Hang on, let’s not invent the wheel here but we did… demotivation for std string : 1 – std::string as a container is fail – 2 – std::allocator is fail – 3 – std::char_traits is fail – 4 – algorithm integration is fail – 5 – copy-on-write became fail… motivation: 1 – because its there which lets us take existing pre and link against fbstring

2. Cooperation with the allocator

3. Copy on write optimization

16.06

Lunch Break! Following lunch, Andrei Alexandrescu will be making an announcement of interest to C++ programmers around the world. Stay tuned.

16.02

Slides from the talk will become available available after the event.

16.00

Summary

– Perfect forwarding fails for some kinds of arguments
– To “specialize” forwarding templates, use helper class templates or std:enable_if
– templates + pImple => explicitly force instantiations.

More information on comp.std.c++ discussion initiated 16 January 2010

15.58

VC11 compiler rejects the explicit instantiation. MS has confirmed two compiler bugs but a workaround exists.

The specialization of the size of the const char, we can fix that but it is left as homework to the attendee.

15.56

All you need to do is take a look at the g++ error message, copy it into a source file and this will explicitly instantiate the template but you have to double check the type deduced by the compiler because it may not be the same as the type passed into the function. But g++ is so self aware that it will actually compile/link/run. 😀

15.54

Explicit Instantiation

The templates again:

template&lt;typename T&gt; Widget::Widget(T&amp;&amp; param) // forwarding ctor

template&lt;typename T&gt; void Widget::setName(T&amp;&amp; param) // forwarding setting

The uses again:

Widget w(&quot;This is a test&quot;)

w.setName(std::string(&quot;This is another test&quot;));

w.setName(s);

Instantionals we need:

Widget ctor with T= const char(&)(15)
setName with T= std::sring and also T = std::string&

15.52

Forwarding and the pImpl Idiom. Pimplemization is straight forward and compiles without fuss on client pre, but the linker disagrees and can’t link! Ha! The usual template instantiation problem: compilers need template source for instantiation. Pimple precludes that.
Possible solutions:
– abandon pImpl, i.e. admit defeat (!)
– use different build chain -> some use link-time instantiation
– perform explicit instantiation

15.48

This is the first adventure in perfect forwarding, how do you deal with specializing for different types without messing with the universal binding, syntax T&&, forced on you.

15.46

Specializing for shared_ptr does not specialize for const shared_ptr. To catch both we can add another partial specializing.

15.45

Random comment from the crowd: I love being able to write >> in the new C++11 without the space. Scott Meyer is envious that someone can be so happy about that.

15.44

We definitely want to have a specialization for shared_ptrs.

15.44

Specializing for pointers unrealistic – rarely useful to distinguish pointer lvalues and rvalues.

Specializing for UDTS very realistic.
– Often useful to distinguish lvalues and rvales
– std::shared_ptr is such a UDT. Moving cheaper than copying.

15.42

Dispatching through a Class Template … enables use of class template partial specialization allows us to avoid the enable_if problem.

15.38

Too much pre on the slide but there are several drawbacks to enable_if, especially when we start using expressions such as const char*

15.37

One solution: overload via std::enable_if

Below pre is generalization…

template&lt;typename T&gt;
typename std::enable_if&lt;
!std::is_pointer&lt;
typename std::remove_reference&lt;T&gt;::type
&gt;::value
&gt;::type
fwd(T&amp;&amp; param)
{
f(std::forward&lt;T&gt;(param))
}
template&lt;typename T&gt;
typename std::enable_if&lt;
std::is_pointer&lt;
typename std::remove_reference&lt;T&gt;::type
&gt;::value
&gt;::type
fwd(T&amp;&amp; param)
{
f(std::forward&lt;T&gt;(param))
}

15.28

!!!
Function templates can’t be specialized, only overloaded. But the form of forwarding templates is constrained, the only parameter type that works is T&&!. Works = handles all combos of const/non-consts expressions.

15.27

“Specializng” forward templates

Forwarding to f is easy

template&lt;typename T&gt; void fwd(T&amp;&amp; param){
f(std::forward&lt;T&gt;(param));
}

but what if we want to specialize our template for pointers?

15.26

Several kinds of arguments can’t be perfect-forwarded.
* 0 as null pointer constant
* braced initializer lists
* integral consts…

15.25

std::make_shared is a function that will return a shared pointer and will typically more efficient than creating your object and than wrapping it in a shared pointer.

std::make_shared:

std::string s(&quot;PF example&quot;);
auto pw = std::make_shared&lt;Widget&gt;(s, std:vector&lt;int&gt;({1,2,3}));
template&lt;typename T, typename.. ParamTs&gt; shared_ptr&lt;T&gt; make_shared...

// in g++

this will go through several layers, and thus perfect forwarding becomes extremely desirable.

15.17

Perfect forwarding is our main interest. Perfect forwarding allows you to pass params straight through templates, we have a common use case, when we build constructors in templated classes. This is when we have problems with the new move construct… perfect forwarding is difficult because we will have to be careful with rvalues and rvalue references.

15.13

In order to support moves in C++11, we have a new construct called rvalue references. New syntax type&& will give you rvalue references. Rvalue references identify candidates for moving versus copying.

Must be careful, in function templates type&& are not really rvalue references. They mean “bind to anything” -> universal references.

Example:

template&lt;typename T&gt; void f(T&amp;&amp; param);
int x;
f(x); //lvalue: T is int&amp;, instantiates f(int&amp;)
f(factory()); // rvalue: T is std::shared_ptr&lt;Widget&gt;, instantiates f(std::shared_ptr&lt;Widget&gt;&amp;&amp;)

15.07

Scott will be speaking about C++11. Lvalues are generally expressions that take an address and they are important for taking copies, copy requests. In C++11, copy requests can often become move requests, the langauge is now taking more care into the difference between l and r values.

15.05

C++ is seeing a new growth because we can’t get faster processors very much anymore. Scott will now give his talk.

15.04

Opening remarks starting, Andrei Alexandrescu is our MC, hosting the first #fb #cpp #conf. There is a great demand and knowledge for C++ and facebook is very excited about this new growth in interest.

14.46

Slides are coming up: Adventures in Perfect Forwarding by Scott Meyers. Opening remarks are expected to be on time. #fb #cpp #conf

14.31

Blinds are going down here at the fb cpp conf. Stage is being prepared. 30 minutes to go before the opening remarks. Scott Meyer’s talk will follow.

14.13

48 minutes to go before opening remarks, the conference hall is slowly filling up.

13.59

The Facebook conference registration will be starting at 11. The’ve got sweet nametags.

[&amp;facebook]() {
return std::move(fast) &amp; break_things();
}

22 thoughts on “Facebook C++ Conference

  1. Thanks vm for the live comments.
    Someone did post it onto Y Hacker News, but, the hipsters there only seem to comment on Python, Lisp, and 1001 varieties of the same JS framework, and seem to ignore C++. Lest they try to understand it I suppose!

Leave a Reply

Your email address will not be published.