Tue, 13 Jul 2010 14:52:29 GMT

Import CSV files in Rails 3

CSV is a ubiquitous importing and exporting format for tables of information conforming to row-column layout. Lines correspond to rows. On each line, commas separate values under each column; hence Comma-Separated Values. Values between commas define table-cell contents, strings or numbers. The format is relatively flexible, relatively easy for machines and even humans to read and write.

In terms of user stories, it answers the following requirement.

As a: database maintainer

I want to: import information from external files, particularly Comma-Separated Value files where rows appear on each line and columns within each row have comma delimiters

So that I can: select columns (and rows) from the CSV files and merge these to create other new pieces of information.

Question is, How do you import CSV files in Rails 3. This article attempts an answer, or at least part of an answer, enough perhaps to point you in the right direction. You can find the sources here at GitHub.

Prerequisites

  • Rails 3
  • FasterCSV gem (Ruby parlance for an installable package)
  • JQuery

Why?

That is an important question. Has this sort of thing not been done before, many times? Well, yes. This is nothing new. But it does offer a fresh look at the subject, especially in the light of the upcoming new version of Rails and with preference for jQuery over Prototype for Rails’ JavaScript library. If you like, consider this a scenic voyage into Rails 3 and jQuery. The seas are fairly calm even though only at Beta.

Check your versions

Entering rails -v on the command line reports,

Rails 3.0.0.beta4

At the time of writing this version is the latest prerelease. Install it using sudo gem install rails --pre; option pre allows installation of prerelease versions, i.e. betas.

Doing similarly for RubyGems using gem -v gives,

1.3.7

Test-driven approach

Begin with a new Rails application.

The command to create a new application has changed since Rails 2. The Rails project now brings together the various disparate scripts and generators under one umbrella, so to speak. New applications now appear when you ask,

rails new APP_PATH

where APP_PATH specifies where your Rails application tree will reside on the local file system. The exact name does not actually matter too much. Although the name does become part of the application proper: it names the application module, and appears in the cookie store key; but you can change these later if you wish.

I will give it a long name. The following command creates a new folder called import-csv-files-in-rails-3. Rails fills it with basic structure and content, enough to get you started.

rails new import-csv-files-in-rails-3

The application is ready to run already. It does not do anything yet, of course. However, you can launch the server and access the application at localhost:3000. Note, syntax for launching a local server has also changed in Rails 3. You no longer enter script/server; it is now in the slightly-less wordy form,

rails s

s for server, naturally.

Test bed

As a simple starting point, imagine a Rails application comprising people and places. People have a first and last name; places have an address. Let the application’s database store these two entities, along with these attributes. Person entity with first and last name attributes, both strings. Place entity with address attribute, a simple string for now.

Entity-Relationship diagram

Generating the basic Rails scaffolding is very easy. Just two commands gives us the entities and their properties, together with corresponding controllers and views to access them via HTML and RESTful XML. g is short for generate in the following two commands.

rails g scaffold person first_name:string last_name:string
rails g scaffold place address:string

Migrate the database using rake db:migrate to update the database schema. By default Rails will use SQLite. Good enough for this exercise. You can then launch a local server to access people at localhost:3000/people and places at localhost:3000/places.

Once set up, the goal will become to import people and place information using CSV files. This is the real challenge.

Test import

One more thing. You may not have a list of people and places handy. But a CSV file or two will definitely be very useful for testing. So just for testing purposes, let us write a script to generate a file.

See below. Copy this and run it to generate some CSV files comprised of random words using Dan Croak’s Webster gem, a package for randomly generating short non-offensive words. You will need to install the gem using sudo gem install webster. Do the same for fastercsv if you haven’t already.

#!/usr/bin/env ruby

require 'optparse'
require 'ostruct'

options = OpenStruct.new
options.rows = 10
options.columns = 10
OptionParser.new do |opts|
  opts.on('-r', '--rows [INTEGER]', 'Number of rows', Integer) do |x|
    options.rows = x
  end
  opts.on('-c', '--columns [INTEGER]', 'Number of columns', Integer) do |x|
    options.columns = x
  end
end.parse!

require 'rubygems'
require 'fastercsv'
require 'webster'

webster = Webster.new

ARGV << '/dev/stdout' if ARGV.empty?
ARGV.each do |file_name|
  FasterCSV.open(file_name, 'w') do |csv|
    options.rows.times do |row_index|
      row = []
      options.columns.times do |column_index|
        row << webster.random_word
      end
      csv << row
    end
  end
end

You will find this script here as a Gist. Download it, make it executable and run using,

./gen-webster-csv.rb test.csv

Two-stage process: upload and merge

For simplicity, the application design will take a two-stage approach to importing.

  • Stage 1: upload
  • Stage 2: merge

Here, upload refers to the action of receiving a raw CSV file from a browser and storing it somewhere within the application’s database or otherwise readily-accessible place. Merge refers to the action of moving selected columns to the database proper. In a real application, you might choose to fold both stages into one action. But for this exercise, we make them distinct. The first stage takes a CSV file and stores it. The second stage lets the user pick out one or more columns and choose where to move the data: to people or places within the database. This second stage does have an advantage: it allows the user to see the imported information and select and deselect columns. Only selected columns will merge and become useful to the app.

Uploading, stage 1

First upload the CSV file, then store it.

Uploading files

Uploading a file is straightforward. Just add a form with a ‘file field.’ When submitted, the form will send a POST request with the file contents attached.

Create a CSV controller with an import action.

$ rails g controller csv import
      create  app/controllers/csv_controller.rb
       route  get "csv/import"
      invoke  erb
      create    app/views/csv
      create    app/views/csv/import.html.erb
      invoke  test_unit
      create    test/functional/csv_controller_test.rb
      invoke  helper
      create    app/helpers/csv_helper.rb
      invoke    test_unit
      create      test/unit/helpers/csv_helper_test.rb

Notice, Rails creates a default controller, a GET-request route for the action, an ERB-based view for the action along with test and helper stubs. Hence, when an external browser sends the application a GET request at csv/import, Rails will match the route, find the ERB template, render the HTML response and reply. The ‘view’ only needs to incorporate a standard HTML form which the browser will promptly display.

Navigate to the import view, app/views/csv/import.html.erb and add the HTML form, as follows.

<%= form_for :upload, :html => { :multipart => true } do |f| -%>
  <%= f.file_field :csv %>
  <%= f.submit "Upload", :disable_with => 'Uploading...' %>
<% end -%>

TextMate makes this very easy to type. Type ff then press Tab to lay out the basic form_for skeleton. Use : then Tab to add a key-value pair of the form :key => "value"; type the key, press Tab and type the value. Tab again moves you to the comma delimiter, and so on.

The most important part of the form is the :html => { :multipart => true } option. This ensures that the POST-request parameters will contain the actual file contents. The request will contain a parameter called upload (as defined by the first argument of the form_for) which maps to a dictionary with key csv mapping to a File object which the application can parse to extract the uploaded CSV. Within the Rails controller for handling the POST request, the File object will be found by accessing params[:upload][:csv].

When the user fills in the form and clicks Upload, the browser will send a POST request containing the CSV-file contents. The Rails application needs to respond. Hence it needs a route describing how the request and URL path translates to controller and action. Under Rails 3, this becomes very easy. Just add the following line to your config/routes.rb. Here, csv/import specifies the URL path; while csv#upload specifies controller and action respectively. When Rails routing sees a POST at that URL path, it will invoke method upload of the CSV controller, class CsvController.

  post "csv/import" => 'csv#upload'

This route needs a corresponding action for the CSV controller. Add this method to your CsvController class, source at app/controllers/csv_controller.rb under the application’s source tree.

  def upload
    # While under development, just respond by rendering some in-line text.
    # Send back the request parameters in JSON (JavaScript Object Notation)
    # format, i.e. something reasonably easy to parse with the human eye.
    render :text => params.to_json
  end

Store the CSV contents

Where to store the imported CSV information? We could store the CSV file verbatim somewhere on a local file system; that would be typical. But let us assume that there is no local file system such as for web applications running in the cloud.

Storing the CSV temporarily in some kind of document-based key-value store like MongoDB or CouchDB would be another alternative. It would live outside the database. The application database would contain a link to the external ‘document’ and read it when required. But no. For now, let us keep it fairly simple. Instead, we will conceptually map the CSV to a simple data model and store the CSV information within the database itself. See data model below.

Import tables and cells

ImportTables have many ImportCells; ImportCells have row and column indices, string contents and belong to one particular ImportTable. The database will index the cells by their import_table_id (back-reference to the owning import table) in order to pull all cells belonging to a table quickly from the database into main memory.

Implementing this model in Rails proves relatively easy.

$ rails g scaffold import_table original_path:string
      invoke  active_record
      create    db/migrate/20100715123031_create_import_tables.rb
      create    app/models/import_table.rb
      invoke    test_unit
      create      test/unit/import_table_test.rb
      create      test/fixtures/import_tables.yml
       route  resources :import_tables
      invoke  scaffold_controller
      create    app/controllers/import_tables_controller.rb
      invoke    erb
      create      app/views/import_tables
      create      app/views/import_tables/index.html.erb
      create      app/views/import_tables/edit.html.erb
      create      app/views/import_tables/show.html.erb
      create      app/views/import_tables/new.html.erb
      create      app/views/import_tables/_form.html.erb
      invoke    test_unit
      create      test/functional/import_tables_controller_test.rb
      invoke    helper
      create      app/helpers/import_tables_helper.rb
      invoke      test_unit
      create        test/unit/helpers/import_tables_helper_test.rb
      invoke  stylesheets
   identical    public/stylesheets/scaffold.css
$ rails g model import_cell import_table_id:integer row_index:integer column_index:integer contents:string
      invoke  active_record
      create    db/migrate/20100715123207_create_import_cells.rb
      create    app/models/import_cell.rb
      invoke    test_unit
      create      test/unit/import_cell_test.rb
      create      test/fixtures/import_cells.yml

To establish the one-to-many association between import table and import cell, add two new lines: a has_many and a belongs_to. Your app/models/import_table.rb and app/models/import_cell.rb should look like this. The import table’s :dependent => :destroy option instructs Rails to delete its cells when you delete a table.

class ImportTable < ActiveRecord::Base
  has_many :import_cells, :dependent => :destroy
end
class ImportCell < ActiveRecord::Base
  belongs_to :import_table
end

For speedy performance, you also need to add an index for the import cells. Generate a migration.

$ rails g migration add_index_for_import_cells
      invoke  active_record
      create    db/migrate/20100715124738_add_index_for_import_cells.rb

Then edit the newly-created migration code, adding instructions to add and remove an index. Without this, whenever you ask Rails for a table’s cells using table.import_cells where table references an instance of ImportTable, the answer will arrive but slowly because the database cannot quickly lookup the cells given a particular import_table_id which uniquely identifies the import table. Note also, the index is non-unique. The answers to table.import_cells which navigate the one-to-many table-to-cell association is an array of many ImportCell instances.

class AddIndexForImportCells < ActiveRecord::Migration
  def self.up
    add_index :import_cells, [:import_table_id]
  end

  def self.down
    remove_index :import_cells, :column => [:import_table_id]
  end
end

Finally, do not forget to run the ‘up’ migrations to upgrade your database schema. Run rake db:migrate.

Upload action

Thus far, we have the means to store CSV table information, but do not yet have the code in place to parse the uploaded CSV into import tables. The CsvController method below uses the FasterCSV gem to parse the incoming file. Edit your app/controllers/csv_controller.rb source to look like this.

require 'fastercsv'

class CsvController < ApplicationController
  def import
  end

  def upload
    table = ImportTable.new :original_path => params[:upload][:csv].original_path
    row_index = 0
    FasterCSV.parse(params[:upload][:csv]) do |cells|
      column_index = 0
      cells.each do |cell|
        table.import_cells.build :column_index => column_index, :row_index => row_index, :contents => cell
        column_index += 1
      end
      row_index += 1
    end
    table.save
    redirect_to import_table_path(table)
  end
end

The code constructs a new import table passing the original file’s path; original_path is a column in the ImportTable database table. The upload method uses the build method on the to-many association between table and cell in order to build new cells and attach them to the table. After parsing, the method saves the table then redirects the user to its ‘show’ page so that the user can see the newly-imported data.

Add the following to Gemfile for Rails 3. This tells Rails to prepare the package for import. Without this you cannot do a require 'fastercsv'. Gemfile appears in the root directory of the application folder. Although, if you are using TextMate on a Mac, locating any file no matter where it lives is very quick using Command+T then start typing the name. A list of matches starts to appear. Just pick the one you want.

gem 'fastercsv'

Then to view the uploaded table, add these lines to your ImportTablesController’s show method.

    @import_cells = @import_table.import_cells
    @row_index_max = @import_cells.map { |cell| cell.row_index }.max
    @column_index_max = @import_cells.map { |cell| cell.column_index }.max

Add the lines above after finding the import table, i.e. after @import_table = ImportTable.find(params[:id]). Then display cell contents in app/views/import_tables/show.html.erb by adding this piece of Ruby-embedded HTML.

<table border="1" cellspacing="1" cellpadding="1">
  <% 0.upto(@row_index_max) do |row_index| %>
    <% row = @import_cells.select { |cell| cell.row_index == row_index } %>
    <tr>
      <% 0.upto(@column_index_max) do |column_index| %>
        <td>
          <%= row.select { |cell| cell.column_index == column_index }[0].contents %>
        </td>
      <% end %>
    </tr>
  <% end %>
</table>

Given the import cells collected by the CSV controller together with the index maxima, this view code builds a standard HTML table comprised of tr and td (table row and table data) tags. The row = line (line 3) optimises the iteration somewhat by first collecting together all cells matching the current row index. This is a smaller number of cells from which to select individual cells matching the column index within the inner loop.

Merging, stage 2

At this point, you can navigate to http://localhost:3000/csv/import. You can generate and drop a CSV file from Finder onto the Choose File button (a Mac shortcut for choosing a file) and click Upload. The Rails application then uploads and builds an import table of cells. By redirection, the app then displays the table. So far so good.

However, this is only half the story. What the user actually wants to do is to create People and Places. The CSV information is just the raw data. Some of the columns will contain names of people (first name, last name) while some columns may contain addresses. The columns just contain random words of course, unless you have some real data to play with. But you get the idea. The data needs merging with the other tables to create workable records.

Sketch below outlines the design for merging CSV import tables.

Sketch

We will add a row of table headers to the import table viewing page. Initially they will contain empty ‘select’ buttons. The user starts by selecting which table to import into. After choosing, the page will automatically update the column headers with the appropriate choices for input field mapping. If the chosen table is Person, column choices switch to none, first name or last name. For Places, only two choices: none or address. The user can then assign columns to fields, clicking Merge when ready.

We therefore want the import-table show page to become dynamic. When the user selects the table, the selectors for column field must change appropriately. To make this happen, we take a page out of Ryan Bates’ book. Since the database records are static, we can set up a JavaScript array of mappings between tables and columns. When the user selects the table to import into, person or place, JavaScript within the page can respond by updating the column selectors.

Converting to jQuery with UJS

We plan to use jQuery rather than Prototype as our JavaScript framework. By default however, Rails has set up our application to include Prototype and script.aculo.us. Remove the following list of files.

  • public/javascripts/controls.js
  • public/javascripts/dragdrop.js
  • public/javascripts/effects.js
  • public/javascripts/prototype.js

Replace the rails.js UJS script found in public/javascripts with the jQuery-based version found at GitHub. UJS stands for Unobtrusive JavaScript, new in Rails 3. By making use of HTML 5 enhancements, the UJS feature avoids littering HTML pages with embedded JavaScript. Download and install jQuery 1.4.1 in the same place, at public/javascripts.

There is one more thing to do before jQuery becomes the new default JavaScript framework for our application: create a new ‘initialiser’ to override Prototype and script.aculo.us as the default framework. Create a new Ruby script, config/initializers/jquery.rb; exact name is not critical. This will run when the Rails server initialises. Add the following code.

module ActionView::Helpers::AssetTagHelper
  remove_const :JAVASCRIPT_DEFAULT_SOURCES
  JAVASCRIPT_DEFAULT_SOURCES = %w(jquery-1.4.1.min rails)
  reset_javascript_include_default
end

This will override the default JavaScript sources with jQuery 1.4.1 and its UJS helper. jQuery ready!

Table headers, table selection, Merge button

Next, we need to add the import table headers, a table selection and a Merge button to the current import-table viewing code. Add the following. I’ve commented out where the table contents go, just to make it a little clearer.

<% form_for :merge do |f| -%>
  <table border="1" cellspacing="1" cellpadding="1">
    <tr>
      <% 0.upto(@column_index_max) do |column_index| %>
        <th>
          <%= f.select(column_index, []) %>
        </th>
      <% end %>
    </tr>
    # same as before goes here
  </table>
  <label><%= 'Select a table for merging:' %></label>
  <%= f.select(:table, @tables.collect { |t| [t, t] }, { :include_blank => true }, { :onchange => "mergeTableSelected();" }) %>
  <%= f.submit "Merge", :disable_with => 'Merging...' %>
<% end -%>

Some things to note about this code:

  • It encloses the entire table within a form. That means that the form will send a POST request with the form parameters as fields. Within the receiving controller method, expression params[:merge] will answer a Hash of values: one for each of the columns, and one for the selected table.
  • Initially, the table headers are null selections. It will produce select tags with no options. We will fill this with content using JavaScript when the user chooses which table (people or places) for merging into.
  • Whenever the user changes the table selection, the browser will invoke a JavaScript function named mergeTableSelected.

The code accesses an instance variable @tables. This is not yet defined. But we will now define it in the import tables controller as follows.

@tables = ActiveRecord::Base.connection.tables.select { |t| t != 'schema_migrations' }

This code performs a little bit of Active Record introspection. It asks Rails for the tables associated with its database connection. The answer is an array of table names, strings. The Array select method de-selects the schema_migrations table, an internal table used by Rails. We want our app to ignore this one, in future possibly others too.

Merge table selected

Next, add the following JavaScript code to the show.html.erb amoung the import tables views. It injects a blob of JavaScript into the show page.

<%= javascript_tag do -%>
  var columns = new Array();
  <% for table in @tables -%>
    <% for column in ActiveRecord::Base.const_get(ActiveRecord::Base.class_name(table)).columns -%>
      columns.push(new Array('<%= table %>', '<%= column.name %>'));
    <% end -%>
  <% end -%>
  function mergeTableSelected() {
    var o = $('select#merge_table option:selected');
    $('th select option').remove();
    $('th select').each(function(i, select) {
      select.options.add(new Option());
      $.each(columns, function(j, column) {
        if (o.text() == column[0]) {
          select.options.add(new Option(column[1]));
        }
      });
    });
  }
<% end -%>

When Rails renders the page, it iterates the introspected @tables, building an array of arrays called columns. Each element contains a table name and a column name. When the user picks a table, the mergeTableSelected() function accesses the selection, removes all existing options from the table header select tags and fills them with new options based on the column names matching the newly selected table name. Does that make sense? The JavaScript relies on jQuery of course. See jQuery documentation.

The merge

Finally, the actual merge operation. When the user clicks Merge, their browser posts the parameters to import_tables/:id where :id stands for the unique identifier of the relevant import table. We need to match this to a method on the import tables controller. Add the following route to routes.rb.

  match 'import_tables/:id' => 'import_tables#merge'

And add the merge method implementation.

  def merge
    # Uncomment the following line if you want to debug this method. However,
    # do not forget to uncomment the gem 'ruby-debug' in Gemfile also; this
    # assumes you have the ruby-debug gem installed too, and do the usual
    # thing if not, i.e. sudo gem install ruby-debug. Calling "debugger"
    # inserts a break point when the server hits this point in the
    # application. From that point you can interrogate the Rails state, single
    # step over or into message-sends and so forth. Very useful for debugging!
    #
    # debugger
    
    import_table = ImportTable.find(params[:id])
    import_cells = import_table.import_cells
    row_index_max = import_cells.map { |cell| cell.row_index }.max
    column_index_max = import_cells.map { |cell| cell.column_index }.max
    
    # Pull the merge parameters from the POST request. The form sets up the
    # column mappings and merge table choice. Then use a little bit of
    # ActiveRecord introspection to derive the table's class name followed by
    # the class object. We will use this latter object to instantiate the
    # merged records.
    merge = params[:merge]
    merge_table = merge[:table]
    klass = ActiveRecord::Base.const_get(ActiveRecord::Base.class_name(merge_table))
    
    # Determine which columns have been mapped. Ignore the rest. Intersect the
    # requested column names with actual column names. Perhaps we should abort
    # and display some error message if the intersection proves empty because
    # the user did not select any columns.
    inverted_merge = merge.invert
    column_names = inverted_merge.keys & klass.column_names
    
    # Finally, create new instances, one per row. Iterate the rows, then for
    # each row, iterate the mapped columns. Select the matching cell and
    # update the record's corresponding column. Redirect to the merged table
    # when done.
    0.upto(row_index_max) do |row_index|
      row = import_cells.select { |cell| cell.row_index == row_index }
      instance = klass.new
      column_names.each do |column_name|
        column_index = inverted_merge[column_name].to_i
        contents = row.select { |cell| cell.column_index == column_index }[0].contents
        instance[column_name] = contents
      end
      instance.save
    end
    eval "redirect_to #{merge_table}_path"
  end

Hopefully, the in-line comments will help to elucidate what this method is doing and why.

Conclusions

What have we learned? Importing CSV is relatively straightforward using Rails 3 and jQuery. But the exercise has raised some outstanding questions and pointed forward to some interesting future directions.

Outstanding questions

  • Uploading of CSV files does not deal with encoding issues. For the English language that might not be an enormous problem. But supporting other languages might begin to throw up bigger and bigger issues depending on how exotic the language needing conversion to Unicode (the internal encoding). There may also be number-formatting issues, e.g. some languages use commas for full stops (periods) within numbers.
  • Column typing has not been considered. Some CSV columns may contain strings, others numbers. How these convert from CSV to real data within the application will depend on the conversion path between uploading and merging of data.

Future directions

  • Some CSV files may have headings in the first row. Import merging might be friendlier if the user could ignore the first row by checking a box entitled, “Ignore first row.”
  • The software does not yet support any form of row selection and de-selection. It merges all rows. A real application would need to trim off the fat: remove duplicates, sort by columns, combine rows by dragging and dropping cell contents in-between rows, etc.

Trackbacks

Use the following link to trackback from your own site:
http://blog.pioneeringsoftware.co.uk/trackbacks?article_id=25

Comments

  • Arwed says

    Great job! Pitty it doesn’t run on Rails 3.0.0.rc

  • Pioneer says

    Good point, Arwed. Thanks. I noticed that too when upgrading from beta4 to rc. I’ve updated the code to support 3.0.0.rc at GitHub.

Leave a comment

(never displayed)

Markdown enabled