Seamless Rich-Text Content Migration: Transforming Inline Images into Kontent.ai Assets

Sep 20, 2023
Kontent.ai

Are you facing the daunting task of migrating your website's rich-text content from an existing CMS to Kontent.ai? Look no further! In my latest blog post, I'll show you a solution that simplifies the process, specifically on the intricate task of relocating inline images seamlessly into the Kontent.ai Asset library.

Every new project presents new challenges when content migration is considered. A common consideration is the migration of media assets. Establishing a reliable and repeatable way to extract media from the source platform and maintain references to it in rich-text content is a typical scenario that needs to be covered.

The Problem

In a platform I've been working with, the data for the migration is a JSON file with rich-text content represented as HTML. For each language version, we have something similar to JSON shown below:

{
  "Language": "en",
  "Headline": "Lorem Ipsum",
  "Body": "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quo modo autem philosophus loquitur? Minime vero, inquit ille, consentit. Quare conare, quaeso. Gloriosa ostentatio in constituendo summo bono. Nec vero alia sunt quaerenda contra Carneadeam illam sententiam. </p><p><img src=\"https://media.example.com/images/035167-khvqdy5a4u5.jpg\" alt=\"Smashing Image\" /></p><p>Res enim concurrent contrariae. Duo Reges: constructio interrete. Itaque ad tempus ad Pisonem omnes. Quodsi ipsam honestatem undique pertectam atque absolutam. Avaritiamne minuis? </p>",
  "LastChanged": "2023-08-30T14:36:02.8498993+00:00"
}

We can see that the image is inline and inside a <p> tag.  There are three issues with this:

  1. The allowed elements in rich text for Kontent.ai don't allow for a <img> tag inside of a <p> tag.
  2. Images in rich text should be represented with a <figure> tag, referencing an item in Kontent's Assets.
  3. The image isn't in Kontent's Assets.

 So this leaves us with two tasks that we need to solve in this order: upload the image to Kontent, and modify the markup so that it can be inserted into the rich text field.

Uploading the file to Kontent.ai

Identifying Inline Images

With the above markup, we have the following image: <img src=\"https://media.example.com/images/035167-khvqdy5a4u5.jpg\" alt=\"Smashing Image\" />. To identify this, we're going to use AngleSharp and a record named InlineImage to store information about the images. Our information record looks like this:

// Inline image record
public record InlineImage
{
    public string Url { get; init; }
    public string Id { get; init; }
    public string AltText { get; init; }
}

You'll notice an ID field here; we will use that as the external ID in Kontent. If this isn't something that you can get from the URL, you'll likely need to try something else to prevent images from being duplicated.

Using AngleSharp, we have a simple piece of code to return all inline images that we find in our markup:

// Find all of the images in the markup
public IEnumerable<InlineImage> ExtractInlineImages(string html)
{
  var images = new List<InlineImage>();
  var parser = new HtmlParser();
  var document = parser.ParseDocument(html);

  if (!document.QuerySelectorAll("p img").Any()) return images;
  
  var r = new Regex(@".*\/images\/(?<id>[0-9]{1,6})-.*");
  
  foreach (var element in document.QuerySelectorAll("p img"))
  {
    var url = element.GetAttribute("src") ?? string.Empty;
    var matches = r.Matches(url);

    if (!matches.Any()) continue;

    images.Add(new InlineImage() {
        Id = matches[0].Groups["id"].Value, 
        Url = URL,
        AltText = element.GetAttribute("data-caption") ?? element.GetAttribute("alt") ?? string.Empty, 
      });
  }
  return images;
}

Retrieving and Handling Image Files

Now that we know some basic information about the file, we can load it from the source. Here we also read the content type header. This will be used when the file is uploaded.

// Get the image and it's metadata
var httpClient = new HttpClient();
var response = await httpClient.GetAsync(image.Url);
if (!response.IsSuccessStatusCode)
{
  // TODO: Handle the exception however you see fit
}

// Retrieve the Content-Type header
if (!response.Content.Headers.TryGetValues("Content-Type", out var contentTypes))
{
  throw new Exception("Failed to retrieve Content-Type header.");
}

var contentType = contentTypes.FirstOrDefault();
var imageBytes = await response.Content.ReadAsByteArrayAsync();

Uploading the Images to Kontent.ai

We can upload the image using the Kontent.ai Management API. First, we to check that the asset does not already exist. This is done by trying to retrieve the item using its external ID. We need to wrap a try-catch block around the call, as the .NET Management SDK will throw an exception if there is no asset with the given external ID.

// Check if the image exists
var imageExternalId = Reference.ByExternalId(image.Id);

try
{
  var kontentImage = await _managementClient.GetAssetAsync(imageExternalId);
  // Update the item here, if we need to.
}
catch (ManagementException)
{
  // Insert the new image here.
}

To perform the upload and add the item to Kontent.ai's Asset library, we first need to create a file reference. This performs the action of uploading the file to Kontent and the returned metadata can then be used to create an AssetUpsertModel. We then use the asset upsert model to create the item in Kontent. 

// Upload the image to Kontent.ai
var fileReference = await _managementClient.UploadFileAsync(
                      new FileContentSource(
                        imageBytes, 
                        Path.GetFileName(image.Url), 
                        contentType
                      ));

var asset = new AssetUpsertModel {
              Title = image.AltText,
              FileReference = fileReference,
              Descriptions = new[] {
                new AssetDescription {
                  Description = image.AltText,
                  Language = Reference.ByCodename(importLanguage)
              }}};

await _managementClient.UpsertAssetAsync(imageExternalId, asset);

That deals with importing the images from their external source into the Kontent.ai asset library. What we need to do next is refer to those images in the content itself.

Modifying the Markup for Kontent.ai

In my solution, I've decoupled the storage of the images in Kontent from altering the markup to refer to the new assets. For me, this seemed cleaner, but you can do this all at the same time if desired.

Kontent.ai uses the figure tag to embed images from its asset library, rather than the img tag that we're importing from. What we're looking to have is something like this: <figure data-asset-external-id=\"OUR-ASSET-ID\"><img src=\"#\" data-asset-external-id=\"OUR-ASSET-ID\"></figure>.

We'll be using the external ID to add the image, as in our case, we can easily determine it from the image URL. 

The below code is all that is needed to update the documents. Using the same method as before, we select all of the image elements in our markup. For each image we find, we create a new figure element and add it to the document before removing the old image tag. 

// Replace `img` with `figure`
var r = new Regex(@".*\/Images\/(?<id>[0-9]{1,6})-.*");

foreach (var element in document.QuerySelectorAll("p img"))
{
  var url = element.Attributes.GetNamedItem("src").Value;
  var matches = r.Matches(url);

  if(!matches.Any())
  {
    element.Parent.RemoveChild(element);
  }
  else
  {
    // Create a new 'figure' element or a custom element 
    var assetExternalId = matches[0].Groups["id"].Value;
    var linkedItemId = Path.GetFileNameWithoutExtension(url);
    var imgElement = document.CreateElement("img");

    imgElement.Attributes.SetNamedItem(new Attr( "src", "#"));
    imgElement.Attributes.SetNamedItem(new Attr( "data-asset-external-id", assetExternalId));

    var figElement = document.CreateElement("figure");
    figElement.Attributes.SetNamedItem(new Attr( "data-asset-external-id", assetExternalId));
    figElement.AppendChild(imgElement);

    element.Parent.Parent.InsertBefore(figElement, element.Parent);
    element.Parent.RemoveChild(element);
  }
}

Conclusion

In conclusion, tackling the intricate task of migrating rich-text content with inline images to Kontent.ai is no longer an insurmountable challenge. Armed with the insights and techniques outlined in this article, you now possess the knowledge to streamline your content migration process efficiently. By addressing the issues of inline images and adopting Kontent.ai's asset-centric approach, you can ensure your digital content remains visually captivating and seamlessly integrated into the Kontent.ai ecosystem.

This journey has highlighted the importance of understanding the intricacies of your content's structure and the technical nuances of both your source and destination platforms. By taking a systematic approach, from extracting inline images to uploading them as Kontent.ai assets, you're not only saving valuable time but also ensuring your content maintains its integrity during migration.

As you embark on your CMS migration journey, remember that each project is unique, and adaptability is key. With the newfound knowledge in your toolkit, you're well-prepared to face the challenges and complexities of content migration head-on, all while harnessing the power of Kontent.ai to elevate your online presence and digital experiences.