New field in Message Archive Management MUC entries: Sender ID

As a part of ensuring GDPR compliance, it is essential to be able to efficiently query MAM MUC data via sender ID (to retrieve user's personal data). Originally, the sender JID could be found only as a part of an encoded XML message element, so finding all items sent by a certain user would be extremely inefficient (or rather: anti-efficient). MongooseIM 3.4.0 uses a modified schema for MAM MUC backends which enables a more efficient extraction.

Below you may find migration instructions specific to your MAM backend.

RDBMS

Step 1

Please execute the following SQL statements on your MIM database:

MySQL

ALTER TABLE mam_muc_message ADD COLUMN sender_id INT UNSIGNED;
CREATE INDEX i_mam_muc_message_sender_id USING BTREE ON mam_muc_message(sender_id);

PostgreSQL

ALTER TABLE mam_muc_message ADD COLUMN sender_id INT;
CREATE INDEX i_mam_muc_message_sender_id ON mam_muc_message USING BTREE (sender_id);

MSSQL

ALTER TABLE [dbo].[mam_muc_message] ADD sender_id bigint;
CREATE INDEX i_mam_muc_message_sender_id ON mam_muc_message(sender_id);

Step 2

Now you have a schema that is compatible with MIM 3.4.0 but isn't GDPR-compliant yet because the new column has no meaningful data.

Please pick your favourite scripting/programming language and populate the new column with the help of a dedicated script. You'll need to iterate over the whole mam_muc_message table with the following algorithm:

  1. Provide message column content to the script.
  2. The script returns sender's JID as username@server string. You need to split it to get a separate username and server.
  3. Select ID from mam_server_user by the username and server. If it doesn't exist, insert a new one (id column is automatically incremented).
  4. Update the sender_id column in mam_muc_message with the retrieved ID.

Cassandra

Step 1

Please execute the following CQL statements on your MIM database:

USE mongooseim;
ALTER TABLE mam_muc_message ADD from_jid varchar;
CREATE INDEX ON mam_muc_message (from_jid);
DESC mam_muc_message;

Step 2

Now you have a schema that is compatible with MIM 3.4.0 but isn't GDPR-compliant yet because the new column has no meaningful data.

Please pick your favourite scripting/programming language and populate the new column with the help of a dedicated script. You'll need to iterate over the whole mam_muc_message table with the following algorithm:

  1. Extract the whole mam_muc_message table. Please make sure to use the paging feature of your Cassandra client, as the MAM tables tend to be very large.
SELECT * FROM mam_muc_message;
  1. To make data extraction faster, MongooseIM stores 2 copies of the message in the table:
cqlsh:mongooseim> SELECT * FROM mam_muc_message WHERE id = 399582233150625537 ALLOW FILTERING;

 room_jid                      | with_nick | id                 | from_jid | message                        | nick_name
-------------------------------+-----------+--------------------+----------+--------------------------------+-----------
 room-ad1d999b9e@muc.localhost |           | 399582233150625537 |     null | 0x8350000001...998de2fa8426837 |       Sid
 room-ad1d999b9e@muc.localhost |       Sid | 399582233150625537 |     null | 0x8350000001...998de2fa8426837 |       Sid
  1. The copy with an empty with_nick column must be updated.
  2. Extract the sender's JID from the message column in the same way as described in the RDBMS migration section. By default cassandra backend uses the eterm format.
  3. Update the from_jid column with the value of the extracted sender's JID:
cqlsh:mongooseim> UPDATE mam_muc_message SET from_jid = 'username@server' WHERE id = 399582233150625537  AND with_nick = '' AND room_jid = 'room-ad1d999b9e@muc.localhost';
cqlsh:mongooseim> SELECT * FROM mam_muc_message WHERE id = 399582233150625537 ALLOW FILTERING;

 room_jid                      | with_nick | id                 | from_jid        | message                        | nick_name
-------------------------------+-----------+--------------------+-----------------+--------------------------------+-----------
 room-ad1d999b9e@muc.localhost |           | 399582233150625537 | username@server | 0x8350000001...998de2fa8426837 |       Sid
 room-ad1d999b9e@muc.localhost |       Sid | 399582233150625537 |            null | 0x8350000001...998de2fa8426837 |       Sid

Riak

Changes to Riak schema are backward compatible with the current MongooseIM release. This means that skipping the migration will cause only some of the new features (namely GDPR data retrival) to not work correctly.

Step 1

Please update the Riak schema:

# Set the RIAK_HOST to your Riak HTTP endpoint
# Set the RIAK_MAM_SCHEMA_PATH to point to new schema path, which
# by default is: RIAK_MAM_SCHEMA_PATH=tools/mam_search_schema.xml
curl -v -XPUT $RIAK_HOST/search/schema/mam \
    -H 'Content-Type:application/xml' \
    --data-binary @${RIAK_MAM_SCHEMA_PATH}

After that we need to either reload all Riak nodes (restart them) or manually reload the schema on live nodes. Reloading the schema on live nodes requires access to Erlang Shell of one of the Riak nodes (any of them). The instruction on how to get to Riak's Erlang shell is beyond this guide, but if you manage to get to it, just call:

yz_index:reload(<<"mam">>).

Step 2

After the schema is posted and reloaded, all "new" objects will be indexed properly as long they contain 2 new fields: msg_owner_jid and mam_type. The new MongooseIM code will insert both of them for all new MAM entires, but for all existing ones need to have the fields added. In order to do that, we need to create a migration script (just pick your favourite scripting/programming language) that will do the following for each object in each bucket of type mam_yz (the object will be referred as obj):

  • Use this dedicated script to convert the obj.packet_register field value into a so called $SENDER_JID.
  • If the script returns $SENDER_JID correctly:
  • set obj.mam_type = 'muc'
  • set obj.msg_owner_jid = $SENDER_JID
  • If the script returns error code -2
  • set obj.mam_type = 'pm'
  • based on obj_yz_rk formatted as $LOCAL_JID/$REMOTE_JID/$MSG_ID, set obj.msg_owner_jid = $LOCAL_JID
  • Save the modified obj

ElasticSearch

Step 1

Please update the mapping for muc_messages:

PUT muc_messages/_mapping/muc
{
  "properties": {
    "mam_id": {
      "type": "long"
    },
    "room": {
      "type": "keyword"
    },
    "from_jid" : {
      "type": "keyword"
    },
    "source_jid": {
      "type": "keyword"
    },
    "message": {
      "type": "text",
      "index": false
    },
    "body": {
      "type": "text",
      "analyzer": "english"
    }
  }
}

Step 2

Now you have a schema that is compatible with MIM 3.4.0 but isn't GDPR-compliant yet because the new field has no meaningful data.

Please pick your favourite scripting/programming language and populate the new column with the help of a dedicated script. You'll need to iterate over the all muc_messages documents with the following algorithm:

  1. Extract some documents (notice the size parameter) for conversion:
GET muc_messages/_search/?size=100&q=!_exists_:from_jid
  1. Extract the sender's JID from the message field in the same way as described in the RDBMS migration section. Elasticsearch backend uses exclusively the xml format.
  2. Update the from_jid column with the value of the extracted sender's JID:
POST localhost:9200/muc_messages/muc/%_id%/_update
{
  "doc": {
    "from_jid" : "%sender's jid%"
  }
}
  1. Repeat all the actions until the full conversion of the database is done.