Handling Paginated Resources in Ruby
Wrap your paginated collections in Enumerable goodness
The thing with paginated data is we can't get it all at once.
Let's say we're using the Trello API. There are a number of Trello endpoints that return paginated data sets, such as boards, lists, cards, and actions (like comments, copies, moves, etc).
If we're querying for Trello cards marked as completed each month since last January, for example, we may need to request several pages of "cards".
In most cases, Trello will provide a default limit, typically 50, on the number of resources returned in a single request. But what if you need more than that? In this post, we'll examine a few ways to collect paginated results in Ruby.
Trello World
The Trello developer docs provide a quickstart in javascript - here's the unofficial Ruby version.
While logged into your Trello account (you'll need one first), retrieve your app key. We won't need the "secret" for this article.
Next, you'll generate an app token. Paste the following URL into your browser with your app key subsituted for the placeholder.
https://trello.com/1/authorize?expiration=never&scope=read,write,account&response_type=token&name=Trello%20World&key=YOUR_KEY
Now that we have an app key and token, we can make authenticated requests to the Trello API. As a quick test, paste the following url with your own key and token as pararameters into your web browser (or use curl
) to read your member data.
https://api.trello.com/1/members/me?key=YOUR_KEY&token=YOUR_TOKEN
You should see a JSON response with attributes like your Trello id, username, bio, etc.
Script mode
Now let's fetch some paginated data in Ruby. For the following examples, we'll be using Ruby 2.2.
To make HTTP requests, we'll also use the http.rb, but feel free to subsitute with your HTTP client of choice. Install the gem yourself with gem install http
or add it to your Gemfile
:
# Gemfile
gem "http"
To make things easier for the remainder, add the key and token as environment variables in your shell. For Mac/Linux users, something like this will work:
# command line
export TRELLO_APP_KEY=your-key
export TRELLO_APP_TOKEN=your-token
Now, let's run Ruby version of our Trello World test.
# trello_eager.rb
require "http"
def app_key
ENV.fetch("TRELLO_APP_KEY")
end
def app_token
ENV.fetch("TRELLO_APP_TOKEN")
end
url = "https://api.trello.com/1/members/me?key=#{app_key}&token=#{app_token}"
puts HTTP.get(url).parse
If it worked correctly, you should see the same result we saw in your browser earlier.
Let's extract a few helpers to build the url. We'll use Addressable::URI
, which is available as a dependency of the http.rb gem as of version 1.0.0.pre1
or otherwise may be installed as gem install addressable
or gem "addressable"
in your Gemfile
:
require "http"
require "addressable/uri"
def app_key
ENV.fetch("TRELLO_APP_KEY")
end
def app_token
ENV.fetch("TRELLO_APP_TOKEN")
end
def trello_url(path, params = {})
auth_params = { key: app_key, token: app_token }
Addressable::URI.new({
scheme: "https",
host: "api.trello.com",
path: File.join("1", path),
query_values: auth_params.merge(params)
})
end
def get(path)
HTTP.get(trello_url(path)).parse
end
Let's Paginate
Now we'll add an alternative method to #get
that can handle pagination.
MAX = 1000
def paginated_get(path, options = {})
params = options.dup
before = nil
max = params.delete(:max) { 1000 }
limit = params.delete(:limit) { 50 }
results = []
loop do
data = get(path, { before: before, limit: limit }.merge(params))
results += data
break if (data.empty? || results.length >= max)
before = data.last["id"]
end
results
end
Given a path and hash of parameter options, we'll build up an array of results by fetching the endpoint and requesting the next set of 50 before the last id of the previos set. Once either the max is reached or no more results are returned from the API, we'll exit the loop.
As a starting point, this works nicely. We can simply use paginated_get
to collect up to 1000 results for a given resource without the caller caring about pages. Here's how we can grab the all the comments we've added to Trello cards:
def comments(params = {})
paginated_get("members/me/actions", filter: "commentCard")
end
comments
#=> [{"id"=>"abcd", "idMemberCreator"=>"wxyz", "data"=> {...} ...}, ...]
The main problem with this approach is that it forces the results to be eager loaded. Unless a max is specified in the method call, we could be waiting for up to 1000 comments to load - 20 requests of 50 comments each - to execute before the results are returned.
Stop, enumerate, and listen
Next step is to refactor our paginated_get
method to take advantage of Ruby's Enumerator
. I previously described Enumerator and showed how it can be used to generate infinite sequences in Ruby, including Pascal's Triangle.
The main advantage of using Enumerator
will be to give callers flexibility to work with the results including filtering, searching, and lazy enumeration.
# trello_enumerator.rb
def paginated_get(path, options = {})
Enumerator.new do |y|
params = options.dup
before = nil
total = 0
limit = params.delete(:limit) { 50 }
loop do
data = get(path, { before: before, limit: limit }.merge(params))
total += data.length
data.each do |element|
y.yield element
end
break if (data.empty? || total >= MAX)
before = data.last["id"]
end
end
end
We've got a few similarities with our first implementation. We still loop over repeated requests for successive pages until either the max is reach or no data is returned from the API. There are a few big differences though.
First, you'll notice we've wrapped our expression in Enumerator
which will serve as the return value for #paginated_get
.
Using an enumerator may look strange but it offers a huge advantage over our first iteration. Enumerators allow callers to interact with data as it is generated. Conceptually, the enumerator represents the algorithm for retrieving or generating data in Enumerable
form.
An enumerator implements the Enumerable
module which means we can call familiar methods like #map
, #select
, #take
, and so on.
Instead of building up an internal array of results, enumerators provide a mechanism for yielding each element even though a block may not be given to the method (how mind blowing is that?).
Now we can use enumerator chains to doing something like the following, where we request comment data lazily, transform the API hash to comment text and select the first two addressed to a colleague.
comments.lazy.
map { |a| a["data"]["text"] }.
select { |t| t.start_with?("@personIWorkWith") }.
take(2).force
We may not need to load all 1000 results to because the enumerators chain is evaluated for each item as it is yielded. This technique provides the caller with a great deal of flexibility. Eager loading can be delayed or avoided altogther - a potential performance gain.
Here are magic lines from #paginated_get
:
data.each do |element|
y.yield element
end
The y.yield
is not the keyword yield
, but the invokation of the #yield
method of Enumerator::Yielder
, an object the enumerator uses internally to pass values through to the first block used in the enumerator chain. For a more detailed look at how enumerators work under the hood, read more about how Ruby works hard so you can be lazy.
A cursor-y example
Let's do one more iteration on our #paginated_get
refactoring. Up to this point, we've been using a "functional" approach; we've just been using a bunch of methods defined in the outermost lexical scope.
First, we'll extract a Client
responsible for sending requests to the Trello API and parsing the responses as JSON.
# trello_client.rb
require "http"
require "addressable/uri"
class Client
def initialize(opts = {})
@app_key = opts.fetch(:app_key, ENV.fetch("TRELLO_APP_KEY"))
@app_token = opts.fetch(:app_token, ENV.fetch("TRELLO_APP_TOKEN"))
end
def get(path, params = {})
HTTP.get(trello_url(path, params)).parse
end
private
def trello_url(path, params = {})
auth_params = { key: @app_key, token: @app_token }
Addressable::URI.new({
scheme: "https",
host: "api.trello.com",
path: File.join("1", path),
query_values: auth_params.merge(params)
})
end
end
Next, we'll provide a class to represent the paginated collection of results to replace our implementation of #paginated_get
.
The Twitter API uses cursors to navigate through pages, a concept similar to "next" and "previous" links on websites. Although Trello doesn't provide explicit cursors in their API, we can still wrap the paginated results in an enumerable class to get similar behavior.
# trello_cursor.rb
require_relative "./trello_client"
class Cursor
def initialize(path, options = {})
@path = path
@params = params
@collection = []
@before = params.fetch(:before, nil)
@limit = params.fetch(:limit, 50)
end
end
The Cursor
will be initialized with a path and params, like our paginated_get
. We'll also maintain an internal @collection
array to cache elements as they are returned from Trello.
class Cursor
private
def client
@client ||= Client.new
end
def fetch_next_page
response = client.get(@path, @params.merge(before: @before, limit: @limit))
@last_response_empty = response.empty?
@collection += response
@before = response.last["id"] unless last?
end
MAX = 1000
def last?
@last_response_empty || @collection.size >= MAX
end
end
We'll introduce a dependency on the Client
to interface with Trello through the private client method. We'll use our client to fetch the next page, append the latest results to our cached @collection
and increment the page number. Now for the key public method:
class Cursor
include Enumerable
def each(start = 0)
return to_enum(:each, start) unless block_given?
Array(@collection[start..-1]).each do |element|
yield(element)
end
unless last?
start = [@collection.size, start].max
fetch_next_page
each(start, &Proc.new)
end
end
end
We've chosen to have our Cursor
expose the Enumerable API by including the Enumerable
module and implementing #each
. This will give cursor instances enumerable behavior so we can simply replace our paginated_get definition to return a new Cursor
.
def paginated_get(path, params)
Cursor.new(path, param)
end
def comments(params = {})
paginated_get("members/me/actions", filter: "commentCard")
end
Let's break down Cursor#each
a bit further. The first line allows us retain the Enumerator
behavior before.
return to_enum(:each, start) unless block_given?
It invokes Kernel#to_enum
when no block is given to an each
method call. In this case, the method returns an Enumerator
that packages the behavior of #each
for an enumerator chain similar to before:
puts comments.each.lazy.
map { |axn| axn["data"]["text"] }.
select { |txt| txt.start_with?("@mgerrior") }.
take(2).force
For more info on using #to_enum
, check out Arkency's Stop including Enumerable, return Enumerator instead.
We also need to yield
each element in the @collection
to pass elements to callers of #each
Array(@collection[start..-1]).each do |element|
yield(element)
end
We iterate from the start of the collection to the end with Array(@collection[start..-1]).each
... but wait! when we start iterating, the @collection
is empty:
def initialize
# ...
@collection = []
end
Wat?
The key comes in the lines that follow in #each
:
unless last?
start = [@collection.size, start].max
fetch_next_page
each(start, &Proc.new)
end
Unless we've encountered the last page, we fetch the next page, which appends the latest results to the collection and we recursively invoke #each
with a starting point. This means #each
will be invoked again with new results until no new data is encountered. Sweet!
A neat trick is how we forward the block given to #each
. When we Proc.new
without explicitly passing a block or proc object, it will instantiate with the block given to its surrounding method if there is one. The behavior is similar to the following:
def each(start = 0, &block)
# ...
each(start, &block)
# ...
end
The main benefit being we don't needlessly invoke Proc.new
by omitting &block
in the arguments. For more on this, read up on Passing Blocks in Ruby without &block
"Recursive each" is a powerful technique for providing a seamless, enumerable interface to paginated or cursored results. I first encountered this approach in the sferik's Twitter gem - a great resource for those considering writing an API wrapper in Ruby.
On your own
Give it a shot! Pick out an API you like to use and play with techniques for modeling its collection resources. This is a great way to get more experience with Ruby's Enumerable. Consider one of these approaches when you need to traverse paginated or partitioned subsets of data in an external or internal API.
Think less about pages and more about data.
Changelog
2016-01-28
- Updated the examples to use the
:before
parameter instead of:page
for requests for successive "pages" - Posted the full source of the examples above on GitHub
Credits
Icons via the Noun Project: